Description:
We are seeking a highly skilled and experienced Senior Site Reliability Engineer (SRE) with a strong focus on Azure. As a Senior SRE, you will play a critical role in driving the security, availability, fault tolerance, and automation of our service. You will be responsible for implementing monitoring, alerts, and automation to make the service more resilient. Your expertise in Azure will be essential in integrating security checks throughout the entire CI/CD process to ensure the service remains secure from inception to production.
Requirements:
- Bachelors or Masters in Computer Science, Computer Engineering, Software Engineering, or equivalent.
- First-class analytical, diagnostic, and problem-solving skills.
- Extensive experience operating, scaling, and automating SaaS applications.
- Experience with production incident management and postmortems.
- Strong proficiency in using Azure.
- Proficiency in scripting and defining infrastructure as code.
- Experience in defining and implementing service monitoring and alerts.
- Passion for working in an exciting environment and delivering new technologies and products.
- Strong belief in DevSecOps and SRE cultures with an operational mindset.
- Experience implementing compliance requirements into a service.
- Ability to learn quickly.
- Excellent verbal and written communication skills, with the ability to collaborate effectively with cross-functional teams.
- Eligibility to work in Canada.
Bonus Points:
- Experience in cybersecurity.
- Experience using multiple public cloud providers.
- Familiarity with monitoring services such as Datadog or Sumo Logic.
- Knowledge of cloud-native applications using the 12-factor methodology.
- Experience with event-driven architectures.
- Familiarity with Agile development approaches such as Scrum, Kanban, or SAFe.
- Proficiency in creating and consuming RESTful APIs.
Responsibilities:
- Driving the service to increased security, higher availability, fault tolerance and automation.
- Implementing monitoring, alerts and automation to make service more resilient.
- Integrating security checks in entire CICD process to keep service secure from inception to running on production.
- Creating dashboards and alerts to monitor the service.
- Defining service KPIs and alerting on them.
- Defining and implementing operational, compliance and security improvement plans.
- Enabling the service to scale further and operate in additional regions while meeting compliance and legal requirements.
- On-call rotation with the other team members.
- Running operations reviews and incident postmortems.
- Managing and implementing disaster recovery plan.
- Providing feedback on development or design plans around their impact to performance operations, scalability and security of the overall service.
- Communicating and coordinating with product security and IT.