Site Reliability Engineer

 

Description:

The Instant Financial Issuance (IFI) Cloud Service includes a wide array of components including web services, application servers, and databases hosted in a Hybrid cloud environment. The Site Reliability Engineer (SRE) will be responsible for ensuring that the SaaS platform is reliable, available, and performant, as well as scalable, secure, and cost-effective. Ultimately, the individual will be responsible for the functional management of all the IFIaaS cloud environments, applications, networks, scoping projects, and the resolution of application and network issues.

Responsibilities:

  • Monitor system issues using various metrics, such as uptime, latency, error rate, throughput, and availability

  • Deploy and maintain monitoring and on-call tools i.e.: Splunk, Prometheus, Grafana, PagerDuty, Datadog, etc.

  • Create strategies to detect issues, such as setting up alerts, dashboards, and health checks

  • Address issues as they arise, using troubleshooting techniques, root cause analysis, and incident management.

  • Design systems to troubleshoot automatically, using self-healing mechanisms, such as auto-scaling, load balancing, and failover, mitigation run books

  • Collaborate with development teams and other stakeholders to identify potential risks, such as security vulnerabilities, performance bottlenecks, deployment issues, or configuration errors

  • implement various risk mitigation strategies, such as patching, backup, redundancy, encryption, or testing

  • Design, build and maintain robust infrastructure built on Azure and AWS, leveraging native cloud technologies i.e. AKS, EKS, managed SQL, Mongo, etc.

  • Define and follow a clear incident response process, which includes roles, responsibilities, escalation, communication, and resolution

  • Use automation and orchestration tools to speed up the recovery process, such as restoring backups, rolling back changes, or deploying fixes

  • Design, implement and maintain robust CI/CD pipelines to automate software delivery process

  • Automate configuration management tasks across multiple servers in Hybrid cloud environments using tools like Ansible, Terraform, etc.

  • Define IaC to provision and manage cloud resources in Hybrid environments (Azure, AWS, On-Prem) including complete lifecycle management scaling and decommissioning.

  • Implement best practices and standards to prevent or reduce the occurrence of emergencies, such as code reviews, testing, and monitoring.

  • Implement and support a hybrid cloud environment in Microsoft Azure and on-premise

  • Update incident response run Books, automation and create new templates as required

  • Manage activities with complete integrity and in accordance with the organization's policies, systems, practices, and programs

  • Collaborate with product teams and other teams to understand the user needs, expectations, and satisfaction.

  • Learn from incidents and post-mortems and implement the action items to prevent recurrence or improve response.

  • Suggest and implement new solutions and technologies to enhance the system and the service, such as optimization, automation, or innovation.

  • Provide after-hours support for production issues on rotational basis with other team members to ensure system availability 24/7/365.

Basic Qualifications:

  • Bachelor’s Degree in Computer Science, Software Engineering, or equivalent combination of education and experience

  • 5+ years of related experience as a Software Engineer, DevOps Engineer, Site Reliability Engineer or a role in similar capacity

  • Extensive experience working with enterprise level micro-services applications, including deployment and maintenance of the applications in distributed environments.

  • Demonstrated hands-on experience and expertise with DevOps tooling (Ansible, Terraform, Jenkins, Octopus deploy, etc.) networks, network security, high-level managerial skills

  • In-Depth hands-on experience with on-prem and cloud compute, storage and networking solutions (vmWare, NetApp, Azure, AWS, etc)

Organization Entrust Datacard
Industry Engineering Jobs
Occupational Category Site Reliability Engineer
Job Location Ottawa,Canada
Shift Type Morning
Job Type Full Time
Gender No Preference
Career Level Experienced Professional
Experience 5 Years
Posted at 2025-03-11 6:04 am
Expires on 2025-04-25