Description:
As a central figure in our SRE team, you will report directly to the Director of Site Reliability Engineering and play a crucial role in aligning IT service management with our broader organizational goals.
Key Responsibilities:
Incident Management:
- Lead the response to IT incidents, ensuring timely and effective resolution. Coordinate across teams to minimize impact and restore service swiftly.
- Develop and refine incident response protocols, ensuring they align with business needs and industry best practices.
Problem Management:
- Proactively identify and analyse recurring IT issues. Work with teams to implement long-term solutions to prevent future incidents and enhance system reliability.
- Collaborate with technical teams to understand root causes and track problem resolution progress.
Change Management:
- Oversee the IT change management process, ensuring all changes are assessed, approved, implemented, and reviewed in a controlled manner.
- Facilitate change advisory board meetings to evaluate the impact of proposed changes and make informed decisions.
Vendor Management:
- Manage relationships with IT service vendors, making sure that renewals and cancelations happen in appropriate time windows.
- Evaluate vendor performance regularly and work with procurement to negotiate terms to align with organizational objectives and IT strategies.
Business Continuity Planning:
- Review and maintain comprehensive business continuity plans and procedures. Ensure these are up-to-date and aligned with organizational risk tolerance.
- Conduct regular business impact analyses and lead drills to test and refine business continuity strategies.