Replit

Staff Site Reliability Engineer

Replit6 months ago
Location

Remote (United States)

Type

Full Time

Salary

USD 220,000 – 325,000

Level

Staff

Role

Site Reliability Engineer

Posted

Oct 27, 2025

Full TimeStaff

The role

Summary

Replit is seeking a Staff Site Reliability Engineer to lead infrastructure reliability and scalability for their platform that serves millions of developers worldwide. This senior role involves architecting observability solutions, leading incident response, and mentoring engineering teams to embed reliability as a core value. The position requires 8-10 years of SRE experience with deep expertise in Kubernetes, distributed systems, and modern observability platforms.

What you'll do

Architect Observability Solutions: Design, build, and implement comprehensive monitoring, logging, and tracing solutions with real-time dashboards and metrics
Define Reliability Standards: Establish and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) across engineering teams
Lead Incident Management: Guide high-impact incident response, conduct blameless post-mortems, and implement preventative measures
Drive Infrastructure Automation: Build CI/CD pipelines, infrastructure as code, and self-healing systems to eliminate operational toil
Optimize Kubernetes Performance: Performance-tune large-scale cloud deployments focusing on Kubernetes, Docker, and GCP optimization
Debug Distributed Systems: Resolve complex technical problems across the stack and implement long-term architectural improvements
Provide Technical Leadership: Review system designs for reliability, scalability, security, and operational integrity across the company
Mentor Engineering Teams: Educate and guide engineers to embed reliability as a core engineering culture value at Replit

What we look for

Technical

Site Reliability Engineering8-10 years of experience in SRE, DevOps, Systems Engineering, or Infrastructure Engineering roles
Programming ProficiencyStrong coding skills in Python or Go with ability to write high-quality, well-tested production code
Distributed SystemsDeep understanding of designing, building, scaling, and maintaining production services in service-oriented architectures
Kubernetes ExpertiseExtensive experience with container orchestration platforms, specifically Kubernetes and cloud-native technologies
Observability SystemsProven track record of designing and implementing sophisticated monitoring, logging, and tracing solutions
Incident ManagementStrong incident response leadership experience for complex systems with demonstrated critical thinking under pressure
Infrastructure as CodeExperience with tools like Terraform, Pulumi, and configuration management systems

Experience

Senior Technical LeadershipExperience working with and mentoring engineers from junior to principal levels across technical teams
Stack DebuggingWillingness and ability to understand, debug, and improve any layer of the technology stack
Communication SkillsExcellent written and verbal communication with ability to explain complex technical concepts clearly

Skills

Required skills

Python/Go ProgrammingStrong programming skills in Python or Go for building production systems and internal tools
KubernetesDeep experience with container orchestration platforms, specifically Kubernetes and cloud-native technologies
Distributed SystemsExpertise in designing, building, and maintaining large-scale distributed systems and service-oriented architectures
ObservabilityProven experience designing and implementing comprehensive monitoring, logging, and tracing solutions
Incident ManagementStrong incident response leadership skills with experience managing complex system outages
Infrastructure as CodeExperience with Terraform, Pulumi, and configuration management tools

Nice to have

Google Cloud PlatformDeep experience with GCP services and cloud-native tools for large-scale deployments
Modern Observability PlatformsExpert-level knowledge of Prometheus, Grafana, Datadog, and OpenTelemetry
High-Performance SystemsExperience designing systems capable of handling high throughput and low latency requirements
Startup EnvironmentFamiliarity with rapid-growth startup environments and scaling challenges
Technical WritingExperience creating company-facing blog posts and training materials

Compensation & benefits

Salary

USD 220,000 – 325,000 (annual)

Stock options

Available

Benefits

Competitive Salary & Equity

Market-competitive compensation package with equity participation

401(k) with 4% Match

Retirement savings plan with company matching contribution up to 4%

Health Insurance

Comprehensive health, dental, vision, and life insurance coverage

Disability Coverage

Short-term and long-term disability insurance protection

Parental Leave

Paid parental, medical, and caregiver leave for family needs

Commuter Benefits

Transportation and commuting expense reimbursement

Wellness Stipend

Monthly allowance for health and wellness activities

Work From Home Setup

In-office setup reimbursement for remote work equipment

Flexible Time Off

Unlimited PTO policy with company holidays

Team Gatherings

Quarterly team building events and company gatherings


Apply for this position

You'll be redirected to the company's application page


Replit

Replit

View all jobs

Replit is a platform that allows developers to code in the browser.

San Francisco, California, United StatesFounded 2015replit.com

Tech Stack

Languages
PythonGo
Frameworks
OpenTelemetry
Tools
KubernetesDockerTerraformPulumiPrometheusGrafanaDatadog
Other
Google Cloud PlatformCI/CD Pipelines
Apply Now