Description:
Collaborates with Agile squads/developers, sustain and business partners and provides significant contributions to develop specifications to resolve problems, and to address enhancement needs focusing in areas of logging, monitoring, and metrics for operational readiness
- Uses technical knowledge, creativity, and company practices to drive down occurrences of incidents through the development of proactive monitoring and alerting.
- Provide attention to incidents according to Service Level Agreements.
- Provide continuous feedback to development teams on system stability, defect analysis, and system enhancements
- Work with IT business and development partners to gather input to develop new capabilities in displaying/monitoring/alerting on key performance indicators (KPIs) by tracking business transactions (BT) in real-time
- Take ownership and accountability for the incident resolution process, participating in RCA and SWAT investigations.
- Plan for validation and verification of changes deployed by infrastructure teams, and development teams.
- Participate in day-to-day real-time technical support and troubleshooting on issues reported from the user/customer base.
- Establish and maintain a good relationship with team members, Product Development, Product management, Customer Service, Client management, and other cross-functional teams.
- Participate in training and information-sharing activities.
- Act as backup for other team members when necessary.
- Requires rotating shift work as needed.
- On-call rotation is required, as 7x24x365 support is required.
What It Takes
- Deep understanding of Linux systems
- Hands-on experience with cloud infrastructure; Google, AWS, or Azure a plus
- Experience with PaaS technologies such as Cloud Foundry, Kubernetes, and Bosh.
- Experience with Continuous delivery tools like Ansible, Rundeck, or Argo CD to set up automated pipelines as needed.
- Experience in supporting middleware technologies such as Apache, Tomcat, and Spring.
- Experience with at least one scripting language such as shell, Perl, python, javascript, etc…
- Experience with installing and configuring Apache and Tomcat.
- Deep expertise in Monitoring distributed systems application architectures and the ability to correlate environment conditions and metrics to application events.
- Experience with APM tools such as Newrelic, Dynatrace, or AppDyanmics.
- Experience with monitoring tools such as Zabbix or check_mk.
- Strong understanding of ITIL principles, certification is a plus.
- Proven problem-solving and analytical ability.
- Excellent organizational/time management skills.
- A proven record of being able to work independently and collaboratively.