Site Reliability Engineer

Description:

The Instant Financial Issuance (IFI) Cloud Service includes a wide array of components including web services, application servers, and databases hosted in a Hybrid cloud environment. The Site Reliability Engineer (SRE) will be responsible for ensuring that the SaaS platform is reliable, available, and performant, as well as scalable, secure, and cost-effective. Ultimately, the individual will be responsible for the functional management of all the IFIaaS cloud environments, applications, networks, scoping projects, and the resolution of application and network issues.

Responsibilities:

Monitor system issues using various metrics, such as uptime, latency, error rate, throughput, and availability
Deploy and maintain monitoring and on-call tools i.e.: Splunk, Prometheus, Grafana, PagerDuty, Datadog, etc.
Create strategies to detect issues, such as setting up alerts, dashboards, and health checks
Address issues as they arise, using troubleshooting techniques, root cause analysis, and incident management.
Design systems to troubleshoot automatically, using self-healing mechanisms, such as auto-scaling, load balancing, and failover, mitigation run books
Collaborate with development teams and other stakeholders to identify potential risks, such as security vulnerabilities, performance bottlenecks, deployment issues, or configuration errors
implement various risk mitigation strategies, such as patching, backup, redundancy, encryption, or testing
Design, build and maintain robust infrastructure built on Azure and AWS, leveraging native cloud technologies i.e. AKS, EKS, managed SQL, Mongo, etc.
Define and follow a clear incident response process, which includes roles, responsibilities, escalation, communication, and resolution
Use automation and orchestration tools to speed up the recovery process, such as restoring backups, rolling back changes, or deploying fixes
Design, implement and maintain robust CI/CD pipelines to automate software delivery process
Automate configuration management tasks across multiple servers in Hybrid cloud environments using tools like Ansible, Terraform, etc.
Define IaC to provision and manage cloud resources in Hybrid environments (Azure, AWS, On-Prem) including complete lifecycle management scaling and decommissioning.
Implement best practices and standards to prevent or reduce the occurrence of emergencies, such as code reviews, testing, and monitoring.
Implement and support a hybrid cloud environment in Microsoft Azure and on-premise
Update incident response run Books, automation and create new templates as required
Manage activities with complete integrity and in accordance with the organization's policies, systems, practices, and programs
Collaborate with product teams and other teams to understand the user needs, expectations, and satisfaction.
Learn from incidents and post-mortems and implement the action items to prevent recurrence or improve response.
Suggest and implement new solutions and technologies to enhance the system and the service, such as optimization, automation, or innovation.
Provide after-hours support for production issues on rotational basis with other team members to ensure system availability 24/7/365.

Basic Qualifications:

Bachelor’s Degree in Computer Science, Software Engineering, or equivalent combination of education and experience
5+ years of related experience as a Software Engineer, DevOps Engineer, Site Reliability Engineer or a role in similar capacity
Extensive experience working with enterprise level micro-services applications, including deployment and maintenance of the applications in distributed environments.
Demonstrated hands-on experience and expertise with DevOps tooling (Ansible, Terraform, Jenkins, Octopus deploy, etc.) networks, network security, high-level managerial skills
In-Depth hands-on experience with on-prem and cloud compute, storage and networking solutions (vmWare, NetApp, Azure, AWS, etc)

Organization	Entrust Datacard
Industry	Engineering Jobs
Occupational Category	Site Reliability Engineer
Job Location	Ottawa,Canada
Shift Type	Morning
Job Type	Full Time
Gender	No Preference
Career Level	Experienced Professional
Experience	5 Years
Posted at	2025-03-11 6:04 am
Expires on	2025-04-25