Skip to main content Skip to footer

SRE Program Manager

Pune Job No. 14368373 Full-time - Remote

工作描述

Program Manager – Site Reliability Engineering (Cloud Native Platform Team) Role Summary The Program Manager will drive day-to-day operations of the Site Reliability Engineering (SRE) team, ensuring alignment with organizational goals for reliability, scalability, and operational excellence. This role requires a strong technical background in SRE practices and proven program management expertise to drive cross-functional initiatives, optimize processes, and deliver measurable business and operations value. Key Responsibilities 1. Operational Leadership o Drive adoption of SRE best practices such as error budgets, SLIs/SLOs, and automation to reduce toil. o Ensure compliance with security, privacy, and regulatory standards in all reliability initiatives. 2. Program Management o Define program scope, objectives, and success criteria for reliability initiatives. o Develop and maintain quarterly roadmaps for SRE projects in collaboration with platform engineering teams. o Track progress, risks, and dependencies across multiple projects using tools like JIRA and Confluence. o Facilitate communication between SRE, development, and leadership teams to ensure transparency and alignment. 3. Performance Measurement o Establish and monitor KPIs for reliability and operational efficiency. o Prepare executive dashboards and reports to translate technical metrics into business impact narratives. o Lead continuous improvement initiatives based on data-driven insights. 4. Stakeholder Engagement o Act as the primary liaison between SRE and other teams (Product, Engineering and Delivery-SOC). o Influence decision-making at all levels through clear communication and structured reporting. Performance Measurement Parameters Incident Metrics: o Mean Time to Detect (MTTD) o Mean Time to Respond (MTTR) o Mean Time to Recovery (MTTR) o Incident Frequency and Severity Change Management: o Change Failure Rate o Change Success Rate Reliability Metrics: o System Uptime / Availability o Service Level Objective Achievement Percentage Operational Efficiency: o Automation Rate o On-call Burden Reduction Measurement Matrix for Leadership Presentation Use a dashboard approach combining: Latency, Traffic, Errors, Saturation. Monthly/Quarterly trends on SLO Incident Heatmaps: Highlighting root causes and resolution times. Business Impact Metrics: Cost savings, risk reduction, and ROI from reliability improvements. Tools: Datadog Experience Requirements Technical Background: o Prior hands-on experience as a Site Reliability Engineer or in DevOps roles. o Strong understanding of cloud-native architectures (Kubernetes, microservices, distributed systems). Program Management Expertise: o 5+ years in program or technical project management. o Proven ability to manage cross-functional initiatives in fast-paced environments. o Familiarity with Agile methodologies and tools (JIRA, Confluence). Leadership & Communication: o Experience presenting technical and operational metrics to executive leadership. o Strong stakeholder management and negotiation skills. Certifications: PMP, SAFe, or SRE Foundation – SAFe preferred

更多了解埃森哲

我们的专长

我们秉承“科技融灵智,匠心承未来”的企业使命,致力于通过引领变革创造价值,为我们的客户、员工、股东、合作伙伴与整个社会创造美好未来。

认识我们的团队

从业务服务部门到各个行业领域, 从职场新人到卓越领袖,我们一直在运用科技创造非凡!

联系我们

加入我们的团队

搜索与你的技能和兴趣匹配的空缺职位。我们希望招聘充满激情、求知若渴、富有创意、专注于解决方案且喜欢团队合作的员工。

埃森哲职位博客

关注埃森哲职业博客,在职场中先人一步,从真正的业内人士处,获取职业建议、内部观点以及可以即学即用的行业真知。