Confluent

Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada)

Confluent1 months ago
Location

Remote, Ontario, Canada

Workplace

Remote

Type

Full Time

Salary

CAD 225,100 – 264,500

Level

Staff

Role

Site Reliability Engineer

Posted

Jan 23, 2026

Full TimeRemoteStaff

The role

Summary

Staff Site Reliability Engineer role at Confluent focusing on incident management and reliability for their multi-cloud data streaming platform. The position combines 75% hands-on engineering work with 25% coaching and strategic program ownership, requiring 10+ years of experience in large-scale distributed systems and incident management tooling.

What you'll do

Systemic Failure Analysis: Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
Tooling Ownership: Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
SLO/SLA Management: Define and maintain SLO/SLA frameworks and use error budgets to guide reliability investments
Incident Response Standards: Own standards, practices, and continuous improvement of incident response across engineering
Customer Communication: Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity
Training & Coaching: Develop and deliver training programs and coach teams through post-mortems
Strategic Partnership: Partner with engineering leaders to elevate reliability practices organization-wide
Proactive Reliability Engineering: Build automation, improve tooling, and design reliability improvements to prevent incidents

What we look for

Technical

SRE Experience10+ years of relevant experience in SRE, incident management, or reliability engineering
Multi-Cloud ExpertiseCloud experience with at least one of AWS, GCP, or Azure (Confluent runs all three)
Large Scale OrganizationExperience navigating reliability/incident programs at 500+ engineer organizations
Incident Management ToolsDeep expertise with incident management tooling (Rootly, PagerDuty, or similar)
Distributed SystemsStrong understanding of distributed systems and failure modes at scale
Observability StackDeep experience with observability: metrics, logging, tracing
Container OrchestrationKubernetes and container orchestration experience
CI/CD SystemsUnderstanding of CI/CD pipelines and release processes

Education

Technical BackgroundBachelor's degree in Computer Science, Engineering, or equivalent practical experience
Continuous LearningDemonstrated ability to rapidly master complex systems and technologies

Experience

Staff-Level EngineeringMinimum 10 years of experience with demonstrated expertise in site reliability engineering
Enterprise ScaleExperience with high-scale systems processing millions of events per second
Organizational ChangeExperience driving org-wide process and cultural changes
Multi-Cloud OperationsHands-on experience operating services across multiple cloud providers

Skills

Required skills

Site Reliability EngineeringExpert-level SRE practices and methodologies
Incident ManagementDeep expertise in incident response and management processes
Multi-Cloud ArchitectureExperience with AWS, GCP, and Azure cloud platforms
Distributed SystemsUnderstanding of large-scale distributed system failure modes
ObservabilityMetrics, logging, and tracing implementation and analysis
KubernetesContainer orchestration and cloud-native technologies
Technical CommunicationStrong written communication for design docs, runbooks, and post-mortems
LeadershipAbility to drive organizational process and cultural changes

Nice to have

Apache KafkaEvent streaming expertise or demonstrated ability to rapidly master complex systems
RootlySpecific experience with Rootly incident management platform
Data StreamingUnderstanding of real-time data processing and streaming architectures
Teaching/CoachingExperience mentoring and training engineering teams

Compensation & benefits

Salary

CAD 225,100 – 264,500 (annual)

Stock options

Available

Benefits

Equity Compensation

Stock options and equity participation in company growth

Remote Work

Fully remote position with flexible work arrangements

Global Team Collaboration

Work with global team using follow-the-sun coverage model

Professional Development

Opportunities to coach and train other engineers

Work-Life Balance

Sustainable hours with clean handoffs between global team members

Equal Opportunity

Inclusive workplace focused on diversity and belonging


Interview process

  1. 1
    Initial Screening Phone or video call with recruiting team to discuss background and role fit
  2. 2
    Technical Assessment In-depth technical discussion covering SRE practices, incident management, and system design
  3. 3
    Behavioral Interview Leadership and communication assessment focusing on coaching and organizational change experience
  4. 4
    System Design Interview Deep dive into distributed systems, reliability patterns, and multi-cloud architecture
  5. 5
    Team Interviews Meetings with current SRE team members and engineering leadership
  6. 6
    Final Interview Senior leadership discussion covering strategic vision and cultural fit

Apply for this position

You'll be redirected to the company's application page


Confluent

Confluent

View all jobs

Confluent is an American data streaming platform company based on Apache Kafka.

Mountain View, California, United StatesFounded 2014confluent.io

Tech Stack

Languages
PythonGoBash/Shell
Frameworks
Apache KafkaKubernetesTerraform
Databases
Apache KafkaTime Series DBs
Tools
RootlyPagerDutyPrometheusGrafanaJaeger/ZipkinJiraConfluenceSlack
Other
AWSGCPAzureDockerCI/CD Pipelines

Interview Guides

14 guides available for Confluent

Apply Now