Site Reliability Engineer

Description:

The Site Reliability Engineer (SRE) will play a critical role in maintaining and improving the reliability, scalability, and performance of our SaaS infrastructure and products. You will work closely with software engineers, product teams, and other stakeholders to design, build, and maintain systems that can handle the demands of a growing customer base. You will focus on automating processes, and continuously improving the availability and performance of our services.

Reliability & Availability: Ensure the high availability, reliability, and performance of one or more SaaS products across production and staging environments. Monitor system health, track key performance indicators, and respond to incidents quickly to minimize downtime.
Incident Management: Perform incident response, troubleshooting, and post-mortem analysis for production incidents. Work to minimize the impact of incidents and drive improvements based on findings.
Automation & Efficiency: Implement automation for routine tasks like deployments, scaling, and maintenance. Develop tools and scripts that improve the operational and cost efficiency of the infrastructure.
Change Management: Work closely with engineering, product, and operations teams to design, deploy, and maintain cloud-based infrastructure and applications. Ensure that new releases and updates are deployed smoothly with minimal disruption.
Monitoring & Alerting: Build and maintain robust monitoring, alerting, and logging systems to provide real-time visibility into the health of our services. Analyze and act upon monitoring data including availability, performance and error logs to proactively detect and resolve issues.
Capacity Planning & Scalability: Monitor system capacity, forecast growth, and ensure that our SaaS platforms scale appropriately to handle increased traffic and load. Design and implement strategies for capacity management.
Security & Compliance: Ensure that security best practices are followed for all infrastructure components. Collaborate with security teams to implement security controls, auditing, and compliance measures.
Performance Optimization: Continuously optimize the performance of our systems and applications by identifying and addressing bottlenecks and improving overall system throughput.
Documentation & Knowledge Sharing: Document systems, processes, and procedures. Foster a culture of knowledge sharing and collaboration across teams to improve operational understanding and best practices.

What You’ll Need:

3+ years of experience as a Site Reliability Engineer, DevOps Engineer, or in a similar role within a SaaS company or cloud environment.
Strong experience with cloud platforms (AWS) and infrastructure-as-code tools like Terraform, CloudFormation, or similar.
Experience with containerization technologies (Docker, Kubernetes) and orchestration platforms.
Experience with application performance monitoring (APM) and log analytics tools (e.g. ELK, Datadog, New Relic, etc.).
Proficiency in programming/scripting languages (Python, Bash, etc.).
Familiarity with CI/CD pipelines and automation tools.
Understanding of web application deployment and hosting fundamentals.
Understanding of database management and performance tuning.
Knowledge of networking fundamentals and web services (HTTP, DNS, load balancing, web application firewall, etc.).
Bachelor's degree in Computer Science, Engineering or a related field, or equivalent experience.
Strong analytical and troubleshooting skills with the ability to identify and resolve complex technical issues in distributed systems.
Excellent communication skills, with the ability to explain complex technical concepts to both technical and non-technical stakeholders.
Must be legally eligible to work in your country of residence which must be either the continental US or Canada.

Preferred Qualifications:

AWS Certified Solutions Architect or similar professional certification.
Experience with managing and maintaining large-scale distributed systems.
Experience with security best practices in cloud environments and SaaS platforms.

Organization	Foundant Technologies Inc
Industry	Engineering Jobs
Occupational Category	Site Reliability Engineer
Job Location	Toronto,Canada
Shift Type	Morning
Job Type	Full Time
Gender	No Preference
Career Level	Intermediate
Experience	3 Years
Posted at	2025-04-04 9:46 am
Expires on	2025-05-19