Principal Site Reliability Engineer

Description:

Collaborates with Agile squads/developers, sustain and business partners and provides significant contributions to develop specifications to resolve problems, and to address enhancement needs focusing in areas of logging, monitoring, and metrics for operational readiness

Uses technical knowledge, creativity, and company practices to drive down occurrences of incidents through the development of proactive monitoring and alerting.
Provide attention to incidents according to Service Level Agreements.
Provide continuous feedback to development teams on system stability, defect analysis, and system enhancements
Work with IT business and development partners to gather input to develop new capabilities in displaying/monitoring/alerting on key performance indicators (KPIs) by tracking business transactions (BT) in real-time
Take ownership and accountability for the incident resolution process, participating in RCA and SWAT investigations.
Plan for validation and verification of changes deployed by infrastructure teams, and development teams.
Participate in day-to-day real-time technical support and troubleshooting on issues reported from the user/customer base.
Establish and maintain a good relationship with team members, Product Development, Product management, Customer Service, Client management, and other cross-functional teams.
Participate in training and information-sharing activities.
Act as backup for other team members when necessary.
Requires rotating shift work as needed.
On-call rotation is required, as 7x24x365 support is required.

What It Takes

Deep understanding of Linux systems
Hands-on experience with cloud infrastructure; Google, AWS, or Azure a plus
Experience with PaaS technologies such as Cloud Foundry, Kubernetes, and Bosh.
Experience with Continuous delivery tools like Ansible, Rundeck, or Argo CD to set up automated pipelines as needed.
Experience in supporting middleware technologies such as Apache, Tomcat, and Spring.
Experience with at least one scripting language such as shell, Perl, python, javascript, etc…
Experience with installing and configuring Apache and Tomcat.
Deep expertise in Monitoring distributed systems application architectures and the ability to correlate environment conditions and metrics to application events.
Experience with APM tools such as Newrelic, Dynatrace, or AppDyanmics.
Experience with monitoring tools such as Zabbix or check_mk.
Strong understanding of ITIL principles, certification is a plus.
Proven problem-solving and analytical ability.
Excellent organizational/time management skills.
A proven record of being able to work independently and collaboratively.

Organization	opentext
Industry	Engineering Jobs
Occupational Category	Principal Site Reliability Engineer
Job Location	Toronto,Canada
Shift Type	Morning
Job Type	Full Time
Gender	No Preference
Career Level	Intermediate
Experience	2 Years
Posted at	2023-09-05 11:29 am
Expires on	2025-03-02