Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada)

Confluent3 months ago

Location

Remote, Ontario, Canada

Workplace

Remote

Type

Full Time

Salary

CAD 225,100 – 264,500

Level

Staff

Role

Site Reliability Engineer

Posted

Jan 23, 2026

Full TimeRemoteStaff

The role

Summary

Staff Site Reliability Engineer role at Confluent focusing on incident management and reliability for their multi-cloud data streaming platform. The position combines 75% hands-on engineering work with 25% coaching and strategic program ownership, requiring 10+ years of experience in large-scale distributed systems and incident management tooling.

What you'll do

Systemic Failure Analysis: Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence

Tooling Ownership: Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack

SLO/SLA Management: Define and maintain SLO/SLA frameworks and use error budgets to guide reliability investments

Incident Response Standards: Own standards, practices, and continuous improvement of incident response across engineering

Customer Communication: Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity

Training & Coaching: Develop and deliver training programs and coach teams through post-mortems

Strategic Partnership: Partner with engineering leaders to elevate reliability practices organization-wide

Proactive Reliability Engineering: Build automation, improve tooling, and design reliability improvements to prevent incidents

What we look for

Technical

SRE Experience10+ years of relevant experience in SRE, incident management, or reliability engineering

Multi-Cloud ExpertiseCloud experience with at least one of AWS, GCP, or Azure (Confluent runs all three)

Large Scale OrganizationExperience navigating reliability/incident programs at 500+ engineer organizations

Incident Management ToolsDeep expertise with incident management tooling (Rootly, PagerDuty, or similar)

Distributed SystemsStrong understanding of distributed systems and failure modes at scale

Observability StackDeep experience with observability: metrics, logging, tracing

Container OrchestrationKubernetes and container orchestration experience

CI/CD SystemsUnderstanding of CI/CD pipelines and release processes

Education

Technical BackgroundBachelor's degree in Computer Science, Engineering, or equivalent practical experience

Continuous LearningDemonstrated ability to rapidly master complex systems and technologies

Experience

Staff-Level EngineeringMinimum 10 years of experience with demonstrated expertise in site reliability engineering

Enterprise ScaleExperience with high-scale systems processing millions of events per second

Organizational ChangeExperience driving org-wide process and cultural changes

Multi-Cloud OperationsHands-on experience operating services across multiple cloud providers

Skills

Required skills

Site Reliability EngineeringExpert-level SRE practices and methodologies

Incident ManagementDeep expertise in incident response and management processes

Multi-Cloud ArchitectureExperience with AWS, GCP, and Azure cloud platforms

Distributed SystemsUnderstanding of large-scale distributed system failure modes

ObservabilityMetrics, logging, and tracing implementation and analysis

KubernetesContainer orchestration and cloud-native technologies

Technical CommunicationStrong written communication for design docs, runbooks, and post-mortems

LeadershipAbility to drive organizational process and cultural changes

Nice to have

Apache KafkaEvent streaming expertise or demonstrated ability to rapidly master complex systems

RootlySpecific experience with Rootly incident management platform

Data StreamingUnderstanding of real-time data processing and streaming architectures

Teaching/CoachingExperience mentoring and training engineering teams

Compensation & benefits

Salary

CAD 225,100 – 264,500 (annual)

Stock options

Available

Benefits

Equity Compensation

Stock options and equity participation in company growth

Remote Work

Fully remote position with flexible work arrangements

Global Team Collaboration

Work with global team using follow-the-sun coverage model

Professional Development

Opportunities to coach and train other engineers

Work-Life Balance

Sustainable hours with clean handoffs between global team members

Equal Opportunity

Inclusive workplace focused on diversity and belonging

Interview process

1
Initial Screening — Phone or video call with recruiting team to discuss background and role fit
2
Technical Assessment — In-depth technical discussion covering SRE practices, incident management, and system design
3
Behavioral Interview — Leadership and communication assessment focusing on coaching and organizational change experience
4
System Design Interview — Deep dive into distributed systems, reliability patterns, and multi-cloud architecture
5
Team Interviews — Meetings with current SRE team members and engineering leadership
6
Final Interview — Senior leadership discussion covering strategic vision and cultural fit