Job Description
Experience Required: 15 years overall (including 5–10 years in Program Management) must have
Note: This is a hybrid role based in Miami, FL (3 days onsite and 2 days remote). Please focus on candidates who are local to Miami or willing to relocate at their own expense to work from Miami in the required hybrid model.
We are looking for a highly experienced SRE Program Manager with a solid software development background. The ideal candidate will have 15+ years of total professional experience and at least 5–10 years in program management.
This role requires strong communication skills, proven ability to lead and manage multiple teams, and deliver high-impact programs that enhance reliability, scalability, and operational excellence.
The ideal candidate will combine program management skills with hands-on SRE knowledge to ensure high availability, performance, and reliability of our services while driving operational excellence.
Key Responsibilities:
Program Management & Leadership
Lead and manage complex operational support programs, ensuring alignment with business objectives and SLAs.
Coordinate cross-functional teams (engineering, operations, support) to deliver seamless production support.
Drive program planning, execution, risk management, and reporting.
Operations & Reliability
Establish and monitor SLOs, SLIs, and SLAs to maintain service reliability and uptime.
Implement incident management, root cause analysis, and post-mortem reviews to drive continuous improvement.
Define and track key operational metrics; ensure compliance with reliability and performance goals.
Drive operational efficiency through process optimization, change management, and strategic planning
Incident Management - Own major incident response and postmortem processes; ensure root cause analysis and long-term resolutions.
Platform Operations - Oversee cloud infrastructure, CI/CD pipelines, observability, tooling, and configuration management.
SRE Practices
Apply SRE principles to optimize system reliability, scalability, and efficiency.
Automate operational tasks and processes (CI/CD, monitoring, alerting, runbooks, etc.).
Partner with engineering teams to embed reliability into system design and deployment.
Next-Gen Capabilities - Bring in next-gen capabilities around AI, GenAI, and Agentic AI to enhance and improve SRE KPIs.
Stakeholder Management:
Act as the primary liaison between business units and technical teams for all operations-related initiatives.
Provide transparent communication on program progress, incidents, risks, and resolutions to senior leadership.
Build strong relationships with internal and external stakeholders to drive collaboration and accountability.
Strategic Leadership: Define and drive the SRE and Infrastructure strategy with business and IT stakeholders.
Cross-Functional Initiatives: Drive cross-functional strategic initiatives and run programs from ideation through execution.
Collaboration: Collaborate with other teams to align program deliverables and success metrics.
Platform Operations: Oversee cloud infrastructure, CI/CD pipelines, observability, tooling, and configuration management.
Security and Compliance: Collaborate with InfoSec teams and identify efficiency opportunities across systems and services.
Reliability and Performance: Lead initiatives to improve the availability, latency, and efficiency of services.
Qualifications:
Minimum 15+ years of relevant experience in SRE, DevOps, or Infrastructure Engineering.
Familiarity with SRE principles and key SRE currencies defined by Google or a similar framework.
Minimum 5+ years of leadership experience managing technical teams in a 24x7 model.
Deep expertise in cloud computing, container orchestration, automation, and observability.
Strong understanding of modern software delivery practices (CI/CD and GitOps).
Proven track record in incident response, system architecture, and operational excellence.
Job Tags
Local area, Relocation,