Staff Software Engineer I - SRE

Site Reliability Engineer · Staff · Full Time

IN Remote India5mo ago

Opens Confluent's application page

Role

What you'll do.

Confluent is seeking a Staff Software Engineer I - SRE to drive proactive reliability improvements across their multi-cloud streaming platform that processes millions of events per second. The role combines 75% hands-on engineering work (building automation, tooling, and reliability systems) with 25% incident management program leadership and cross-team coaching. This position requires 10+ years of SRE experience with deep expertise in distributed systems, cloud platforms, and incident management tooling.

Responsibilities

Proactive Reliability Engineering: Analyze systemic failure patterns and design improvements that prevent incident recurrence across multi-cloud streaming platform
SLO/SLA Framework Management: Define and maintain Service Level Objectives and Agreements, using error budgets to guide reliability investments
Automation and Tooling Development: Build tooling and automation to reduce incident response toil and scale team impact across engineering organization
Incident Management Platform Ownership: Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
Reliability Data Analysis: Analyze reliability data to identify systemic improvements and build dashboards that drive actionable insights
AI-Assisted Incident Analysis: Explore AI-assisted approaches to documentation quality and incident analysis for improved response times
Incident Response Program Leadership: Own standards, practices, and continuous improvement of incident response across global engineering teams
Incident Commander Training: Define incident commander eligibility criteria, manage rotation, and develop training programs for engineering teams
Escalation Management: Serve as escalation incident commander when incidents exceed team's management chain capabilities
Post-Mortem Facilitation: Coach teams through post-mortems and develop actionable corrective actions to prevent recurrence
Customer Root Cause Analysis: Edit and review customer-facing incident documents ensuring quality, clarity, and technical accuracy
Cross-Team Reliability Leadership: Partner with engineering leaders to elevate reliability practices and serve as trusted advisor

Qualifications

What we look for.

Technical

SRE Experience
10+ years in Site Reliability Engineering, incident management, or reliability engineering
Cloud Platform Expertise
Deep experience with at least one of AWS, GCP, or Azure cloud platforms
Incident Management Tools
Deep expertise with incident management tooling such as Rootly, PagerDuty, or similar platforms
Distributed Systems Knowledge
Strong understanding of distributed systems and failure modes at scale
Observability Expertise
Deep experience with observability: metrics, logging, tracing, and ability to diagnose complex issues
Container Orchestration
Kubernetes and container orchestration experience
CI/CD Pipeline Understanding
Understanding of CI/CD pipelines and release processes
Systems Thinking
Understanding how infrastructure design choices affect failure modes and recovery
SLO/SLA Framework Knowledge
Familiarity with Service Level Objectives and Service Level Agreements frameworks

Education

Bachelor's Degree
Bachelor's degree in Computer Science, Engineering, or related technical field preferred
Advanced Technical Certifications
Cloud platform certifications (AWS, GCP, Azure) and SRE-related certifications preferred

Experience

Large Organization Experience
Large company experience navigating reliability/incident programs at 500+ engineer organizations
Cross-Organizational Leadership
Track record as trusted advisor across engineering organizations
Process and Cultural Change
Experience driving organization-wide process and cultural changes
Technical Communication
Strong written communication skills for design docs, one-pagers, and runbooks
Post-Mortem Facilitation
Demonstrated experience with post-mortem facilitation and incident analysis
Async Collaboration
Experience with asynchronous collaboration across multiple time zones

Skills

Required

Site Reliability Engineering
10+ years of hands-on SRE experience with focus on proactive reliability improvements
Cloud Platforms
Deep expertise in at least one major cloud platform (AWS, GCP, or Azure)
Incident Management
Expert-level experience with incident management tools like Rootly, PagerDuty
Distributed Systems
Strong understanding of distributed systems architecture and failure modes at scale
Observability
Deep experience with metrics, logging, tracing, and complex issue diagnosis
Kubernetes
Container orchestration experience for multi-cloud deployments
Systems Thinking
Understanding of how infrastructure design affects failure modes and recovery
Leadership
Track record as trusted advisor driving organization-wide process improvements
Technical Communication
Strong written communication for design docs, runbooks, and incident documentation

Preferred

Apache Kafka
Nice to have
Experience with Kafka or event streaming platforms, or demonstrated rapid mastery of complex systems
Multi-Cloud Experience
Nice to have
Experience with 2+ cloud platforms (AWS, GCP, Azure) for enhanced reliability
AI-Assisted Workflows
Nice to have
Experience with modern CI/CD, GitHub, and AI-assisted workflows for automation
Large-Scale Organizations
Nice to have
Experience in organizations with 500+ engineers and complex reliability programs
Post-Mortem Facilitation
Nice to have
Advanced skills in facilitating effective post-mortems and developing actionable corrective actions

Tech stack

Languages

PythonGoJava

Frameworks

Apache KafkaKubernetes

Databases

Apache KafkaClickHousePostgreSQL

Tools

RootlyPagerDutyGrafanaPrometheusTerraformGitHubJiraConfluenceSlack

Other

AWSGCPAzureOpenTelemetryDocker

Compensation

Pay and benefits.

Benefits

Equal Opportunity Workplace
Employment decisions based on job-related criteria without regard to protected classifications
Inclusive Culture
Belonging-focused workplace with emphasis on diverse perspectives and leadership opportunities
Global Team Collaboration
Follow-the-sun coverage with clean handoffs maintaining sustainable work hours
Professional Growth
Opportunities to lead cross-team initiatives and drive organization-wide improvements
Remote Work Flexibility
Remote work opportunity in India with global team collaboration

Process

Interview steps.

01
Initial Screening
Phone or video screening with recruiter covering background, experience, and role expectations
02
Technical Deep Dive
Technical interview focusing on SRE experience, incident management, and distributed systems knowledge
03
System Design
System design interview covering reliability architecture, failure modes, and scalability patterns
04
Behavioral Interview
Leadership and collaboration assessment focusing on cross-team influence and process improvement experience
05
Final Round
Panel interview with engineering leadership covering strategic thinking and cultural fit

Full posting

Original listing.

We’re not just building better tech. We’re rewriting how data moves and what the world can do with it. With Confluent, data doesn’t sit still. Our platform puts information in motion, streaming in near real-time so companies can react faster, build smarter, and deliver experiences as dynamic as the world around them.

It takes a certain kind of person to join this team. Those who ask hard questions, give honest feedback, and show up for each other. No egos, no solo acts. Just smart, curious humans pushing toward something bigger, together.

One Confluent. One Team. One Data Streaming Platform.

About the Role:

Confluent Cloud processes millions of events per second across AWS, GCP, and Azure. When incidents happen in a multi-cloud streaming platform, they happen at scale—data in motion, exactly-once semantics, and cascading failure modes that require deep systems thinking. We need an expert-level engineer who can drive proactive reliability improvements that prevent these incidents before they occur.

This role combines hands-on technical work with strategic program ownership. You'll spend roughly 75% of your time on engineering: building automation, improving tooling, analyzing systemic failure patterns, and designing reliability improvements. The remaining 25% is teaching and coordination: coaching teams through post-mortems, training incident commanders, and evolving our incident response practices.

You'll be part of a global team with follow-the-sun coverage, with clean handoffs that keep everyone working sustainable hours. Confluent has 800-1000 engineers across highly autonomous teams. This role sits within Cloud Architecture and Reliability - Supportability (CAR-S), a horizontal team that owns reliability standards and tooling across engineering. You're the person who makes us need incident management less.

What You Will Do:

Proactive Reliability Engineering (~75% of role) · Analyze systemic failure patterns and design improvements that prevent incident recurrence · Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments · Build tooling and automation to reduce incident response toil and scale team impact · Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack · Analyze reliability data to identify systemic improvements; build dashboards that drive action · Explore AI-assisted approaches to documentation quality and incident analysis · Design scalable reliability standards that reduce reactive workload over time.
Incident Management Program (~25% of role) · Own standards, practices, and continuous improvement of incident response · Define incident commander eligibility criteria and manage the rotation · Available as escalation IC when incidents exceed a team's management chain · Develop and deliver training programs for engineering teams at all levels · Coach teams through post-mortems and on developing actionable corrective actions.
Customer Root Cause Analysis (CRCA) · Edit and review customer-facing incident documents to ensure quality and clarity · Drive turnaround SLAs while maintaining technical accuracy · Ensure clear explanation of what happened, why, and how we'll prevent recurrence
Cross-Team Leadership · Partner with engineering leaders to elevate reliability practices · Be the expert who teams proactively engage for guidance

What You Will Bring:

10+ years in SRE, incident management, or reliability engineering · Cloud experience with at least one of AWS, GCP, or Azure·
Deep expertise with incident management tooling (Rootly, PagerDuty, or similar platforms)
Strong understanding of distributed systems and failure modes at scale—Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems
Deep experience with observability: metrics, logging, tracing—ability to diagnose complex issues · Kubernetes and container orchestration experience · Understanding of CI/CD pipelines and release processes · Systems thinking: understanding how infrastructure design choices affect failure modes and recovery · Familiarity with SLO/SLA frameworks.
Track record as a trusted advisor across engineering organizations · Experience driving org-wide process and cultural changes · Strong written communication (design docs, one-pagers, runbooks) · Post-mortem facilitation experience · Experience with async collaboration across time zones
Large company experience navigating reliability/incident programs at 500+ engineer organizations·

What Gives You an Edge:

Multi-cloud experience (minimum 2+ of AWS/GCP/Azure).
Modern CI/CD, GitHub, AI-assisted workflows—you'll have the freedom to build what you need

Ready to build what's next? Let’s get in motion.

Come As You Are

Belonging isn’t a perk here. It’s the baseline. We work across time zones and backgrounds, knowing the best ideas come from different perspectives. And we make space for everyone to lead, grow, and challenge what’s possible.

We’re proud to be an equal opportunity workplace. Employment decisions are based on job-related criteria, without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other classification protected by law.

Interview prep

14 guides for Confluent

Apply for this position

Redirects to Confluent's application page.

Other roles

More at Confluent.

View all 23 roles