Confluent

Staff Software Engineer I - SRE

Confluent3 weeks ago
Location

IN Remote India

Type

Full Time

Level

Staff

Role

Site Reliability Engineer

Posted

Feb 16, 2026

Full TimeStaff

The role

Summary

Confluent is seeking a Staff Software Engineer I - SRE to drive proactive reliability improvements across their multi-cloud streaming platform that processes millions of events per second. The role combines 75% hands-on engineering work (building automation, tooling, and reliability systems) with 25% incident management program leadership and cross-team coaching. This position requires 10+ years of SRE experience with deep expertise in distributed systems, cloud platforms, and incident management tooling.

What you'll do

Proactive Reliability Engineering: Analyze systemic failure patterns and design improvements that prevent incident recurrence across multi-cloud streaming platform
SLO/SLA Framework Management: Define and maintain Service Level Objectives and Agreements, using error budgets to guide reliability investments
Automation and Tooling Development: Build tooling and automation to reduce incident response toil and scale team impact across engineering organization
Incident Management Platform Ownership: Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
Reliability Data Analysis: Analyze reliability data to identify systemic improvements and build dashboards that drive actionable insights
AI-Assisted Incident Analysis: Explore AI-assisted approaches to documentation quality and incident analysis for improved response times
Incident Response Program Leadership: Own standards, practices, and continuous improvement of incident response across global engineering teams
Incident Commander Training: Define incident commander eligibility criteria, manage rotation, and develop training programs for engineering teams
Escalation Management: Serve as escalation incident commander when incidents exceed team's management chain capabilities
Post-Mortem Facilitation: Coach teams through post-mortems and develop actionable corrective actions to prevent recurrence
Customer Root Cause Analysis: Edit and review customer-facing incident documents ensuring quality, clarity, and technical accuracy
Cross-Team Reliability Leadership: Partner with engineering leaders to elevate reliability practices and serve as trusted advisor

What we look for

Technical

SRE Experience10+ years in Site Reliability Engineering, incident management, or reliability engineering
Cloud Platform ExpertiseDeep experience with at least one of AWS, GCP, or Azure cloud platforms
Incident Management ToolsDeep expertise with incident management tooling such as Rootly, PagerDuty, or similar platforms
Distributed Systems KnowledgeStrong understanding of distributed systems and failure modes at scale
Observability ExpertiseDeep experience with observability: metrics, logging, tracing, and ability to diagnose complex issues
Container OrchestrationKubernetes and container orchestration experience
CI/CD Pipeline UnderstandingUnderstanding of CI/CD pipelines and release processes
Systems ThinkingUnderstanding how infrastructure design choices affect failure modes and recovery
SLO/SLA Framework KnowledgeFamiliarity with Service Level Objectives and Service Level Agreements frameworks

Education

Bachelor's DegreeBachelor's degree in Computer Science, Engineering, or related technical field preferred
Advanced Technical CertificationsCloud platform certifications (AWS, GCP, Azure) and SRE-related certifications preferred

Experience

Large Organization ExperienceLarge company experience navigating reliability/incident programs at 500+ engineer organizations
Cross-Organizational LeadershipTrack record as trusted advisor across engineering organizations
Process and Cultural ChangeExperience driving organization-wide process and cultural changes
Technical CommunicationStrong written communication skills for design docs, one-pagers, and runbooks
Post-Mortem FacilitationDemonstrated experience with post-mortem facilitation and incident analysis
Async CollaborationExperience with asynchronous collaboration across multiple time zones

Skills

Required skills

Site Reliability Engineering10+ years of hands-on SRE experience with focus on proactive reliability improvements
Cloud PlatformsDeep expertise in at least one major cloud platform (AWS, GCP, or Azure)
Incident ManagementExpert-level experience with incident management tools like Rootly, PagerDuty
Distributed SystemsStrong understanding of distributed systems architecture and failure modes at scale
ObservabilityDeep experience with metrics, logging, tracing, and complex issue diagnosis
KubernetesContainer orchestration experience for multi-cloud deployments
Systems ThinkingUnderstanding of how infrastructure design affects failure modes and recovery
LeadershipTrack record as trusted advisor driving organization-wide process improvements
Technical CommunicationStrong written communication for design docs, runbooks, and incident documentation

Nice to have

Apache KafkaExperience with Kafka or event streaming platforms, or demonstrated rapid mastery of complex systems
Multi-Cloud ExperienceExperience with 2+ cloud platforms (AWS, GCP, Azure) for enhanced reliability
AI-Assisted WorkflowsExperience with modern CI/CD, GitHub, and AI-assisted workflows for automation
Large-Scale OrganizationsExperience in organizations with 500+ engineers and complex reliability programs
Post-Mortem FacilitationAdvanced skills in facilitating effective post-mortems and developing actionable corrective actions

Compensation & benefits

Benefits

Equal Opportunity Workplace

Employment decisions based on job-related criteria without regard to protected classifications

Inclusive Culture

Belonging-focused workplace with emphasis on diverse perspectives and leadership opportunities

Global Team Collaboration

Follow-the-sun coverage with clean handoffs maintaining sustainable work hours

Professional Growth

Opportunities to lead cross-team initiatives and drive organization-wide improvements

Remote Work Flexibility

Remote work opportunity in India with global team collaboration


Interview process

  1. 1
    Initial Screening Phone or video screening with recruiter covering background, experience, and role expectations
  2. 2
    Technical Deep Dive Technical interview focusing on SRE experience, incident management, and distributed systems knowledge
  3. 3
    System Design System design interview covering reliability architecture, failure modes, and scalability patterns
  4. 4
    Behavioral Interview Leadership and collaboration assessment focusing on cross-team influence and process improvement experience
  5. 5
    Final Round Panel interview with engineering leadership covering strategic thinking and cultural fit

Apply for this position

You'll be redirected to the company's application page


Confluent

Confluent

View all jobs

Confluent is an American data streaming platform company based on Apache Kafka.

Mountain View, California, United StatesFounded 2014confluent.io

Tech Stack

Languages
PythonGoJava
Frameworks
Apache KafkaKubernetes
Databases
Apache KafkaClickHousePostgreSQL
Tools
RootlyPagerDutyGrafanaPrometheusTerraformGitHubJiraConfluenceSlack
Other
AWSGCPAzureOpenTelemetryDocker

Interview Guides

14 guides available for Confluent

Apply Now