Replit

Staff Site Reliability Engineer

Replit1 months ago
Location

Remote - Europe

Workplace

Remote

Type

Full Time

Salary

USD 180,000 – 250,000

Level

Staff

Role

Site Reliability Engineer

Posted

May 20, 2026

Full TimeRemoteStaff

The role

Summary

Replit is seeking a Staff Site Reliability Engineer to enhance the reliability and scalability of their global software development platform. The ideal candidate will proactively improve infrastructure, design observability solutions, and drive automation across Replit's cloud-native environment, supporting millions of developers worldwide.

What you'll do

Observability Architecture: Design and implement comprehensive monitoring, logging, and tracing solutions with real-time system health dashboards
Reliability Standards: Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) across engineering teams
Incident Management: Lead high-impact incident responses, conduct blameless post-mortems, and develop preventative automation strategies
Infrastructure Automation: Architect CI/CD pipelines, infrastructure as code, and self-healing systems to eliminate operational toil
Performance Optimization: Performance-tune large-scale Kubernetes deployments, reduce latency, and implement strategic capacity planning
Systems Mentorship: Educate and mentor engineering teams to embed reliability as a core organizational value

What we look for

Technical

Programming LanguagesAdvanced programming skills in Python or Go with ability to write high-quality, well-tested code
Cloud InfrastructureExpert-level experience with Kubernetes, container orchestration, and cloud-native technologies
Monitoring ToolsProficiency in observability platforms such as Prometheus, Grafana, and OpenTelemetry

Education

Degree PreferenceBachelor's degree in Computer Science, Software Engineering, or related technical field preferred

Experience

Professional Experience8-10 years in Site Reliability Engineering, DevOps, Systems Engineering, or Infrastructure Engineering
Incident ResponseProven track record of managing complex system incidents with critical thinking under pressure
Distributed SystemsDemonstrated experience designing, building, and maintaining large-scale production services

Skills

Required skills

KubernetesDeep understanding of container orchestration and cloud-native deployment strategies
Infrastructure as CodeExpertise with Terraform, Pulumi, or similar configuration management tools
Distributed SystemsAbility to design and optimize complex, service-oriented architectures

Nice to have

Google Cloud PlatformAdvanced knowledge of GCP services and ecosystem
Go ProgrammingExpert-level Go language skills for systems development
Startup ExperienceFamiliarity with rapid-growth technology environments

Compensation & benefits

Salary

USD 180,000 – 250,000 (annual)

Stock options

Available

Benefits

Health Insurance

Comprehensive health, dental, vision, and life insurance coverage

Retirement Planning

401(k) program with 4% company match for US employees

Leave Policies

Flexible paid parental, medical, and caregiver leave

Time Off

Flexible Time Off (FTO) policy with additional holidays

Wellness Benefits

Monthly wellness stipend and autonomous work environment


Interview process

  1. 1
    Initial Screening HR recruiter phone screen to assess background and initial fit
  2. 2
    Technical Assessment Comprehensive technical interview focusing on SRE skills and system design
  3. 3
    System Design Challenge In-depth evaluation of candidate's approach to building reliable, scalable infrastructure
  4. 4
    Team Interviews Multiple rounds with SRE team members to assess technical and cultural alignment
  5. 5
    Final Leadership Interview Meeting with engineering leadership to discuss vision and potential impact

Apply for this position

You'll be redirected to the company's application page