Staff Site Reliability Engineer

Replit1 months ago

Location

Remote - Europe

Workplace

Remote

Type

Full Time

Salary

USD 180,000 – 250,000

Level

Staff

Role

Site Reliability Engineer

Posted

May 20, 2026

Full TimeRemoteStaff

The role

Summary

Replit is seeking a Staff Site Reliability Engineer to enhance the reliability and scalability of their global software development platform. The ideal candidate will proactively improve infrastructure, design observability solutions, and drive automation across Replit's cloud-native environment, supporting millions of developers worldwide.

What you'll do

Observability Architecture: Design and implement comprehensive monitoring, logging, and tracing solutions with real-time system health dashboards

Reliability Standards: Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) across engineering teams

Incident Management: Lead high-impact incident responses, conduct blameless post-mortems, and develop preventative automation strategies

Infrastructure Automation: Architect CI/CD pipelines, infrastructure as code, and self-healing systems to eliminate operational toil

Performance Optimization: Performance-tune large-scale Kubernetes deployments, reduce latency, and implement strategic capacity planning

Systems Mentorship: Educate and mentor engineering teams to embed reliability as a core organizational value

What we look for

Technical

Programming LanguagesAdvanced programming skills in Python or Go with ability to write high-quality, well-tested code

Cloud InfrastructureExpert-level experience with Kubernetes, container orchestration, and cloud-native technologies

Monitoring ToolsProficiency in observability platforms such as Prometheus, Grafana, and OpenTelemetry

Education

Degree PreferenceBachelor's degree in Computer Science, Software Engineering, or related technical field preferred

Experience

Professional Experience8-10 years in Site Reliability Engineering, DevOps, Systems Engineering, or Infrastructure Engineering

Incident ResponseProven track record of managing complex system incidents with critical thinking under pressure

Distributed SystemsDemonstrated experience designing, building, and maintaining large-scale production services

Skills

Required skills

KubernetesDeep understanding of container orchestration and cloud-native deployment strategies

Infrastructure as CodeExpertise with Terraform, Pulumi, or similar configuration management tools

Distributed SystemsAbility to design and optimize complex, service-oriented architectures

Nice to have

Google Cloud PlatformAdvanced knowledge of GCP services and ecosystem

Go ProgrammingExpert-level Go language skills for systems development

Startup ExperienceFamiliarity with rapid-growth technology environments

Compensation & benefits

Salary

USD 180,000 – 250,000 (annual)

Stock options

Available

Benefits

Health Insurance

Comprehensive health, dental, vision, and life insurance coverage

Retirement Planning

401(k) program with 4% company match for US employees

Leave Policies

Flexible paid parental, medical, and caregiver leave

Time Off

Flexible Time Off (FTO) policy with additional holidays

Wellness Benefits

Monthly wellness stipend and autonomous work environment

Interview process

1
Initial Screening — HR recruiter phone screen to assess background and initial fit
2
Technical Assessment — Comprehensive technical interview focusing on SRE skills and system design
3
System Design Challenge — In-depth evaluation of candidate's approach to building reliable, scalable infrastructure
4
Team Interviews — Multiple rounds with SRE team members to assess technical and cultural alignment
5
Final Leadership Interview — Meeting with engineering leadership to discuss vision and potential impact