OpenAI

Software Engineer, Reliability

OpenAI5 months ago
Location

San Francisco

Type

Full Time

Salary

USD 230,000 – 490,000

Level

Senior

Role

Site Reliability Engineer

Posted

Oct 17, 2025

Full TimeSenior

The role

Summary

OpenAI is seeking a Software Engineer, Reliability to join their Applied Engineering team in building scalable, resilient infrastructure that safely delivers AI technology to millions of users worldwide. This role focuses on designing fault-tolerant systems, implementing automation tools, and maintaining reliability standards while supporting OpenAI's rapid growth in the competitive AI landscape.

What you'll do

Infrastructure Scalability Design: Design and implement solutions to ensure infrastructure scales to meet rapidly increasing demands from OpenAI's growing user base
Testing Platform Development: Build and maintain load, chaos, and synthetic testing software for development teams to enhance system reliability
Automation Tool Development: Create and maintain automation tools to streamline repetitive tasks and improve overall system reliability
Resource Lifecycle Management: Build and maintain platforms for CPU/storage, GPU, and network lifecycle management to drive efficiency and dynamic optimization
Fault-Tolerant System Design: Implement fault-tolerant and resilient design patterns to minimize service disruptions and ensure high availability
SLO/SLI Management: Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure system reliability
Cross-Functional Collaboration: Partner with researchers, engineers, product managers, and designers to bring new AI features and research capabilities to market
Incident Response: Participate in on-call rotation to respond to critical incidents and ensure 24/7 system availability for global users

What we look for

Technical

Cloud Infrastructure ProficiencyStrong expertise in cloud platforms (AWS, GCP, Azure) for building scalable systems
Programming SkillsProficiency in programming languages such as Python, Go, or similar for automation and tool development
Container TechnologiesExperience with Docker containerization and Kubernetes orchestration platforms
Infrastructure as CodeKnowledge of IaC tools like Terraform or CloudFormation for automated infrastructure management
Observability ToolsExperience with monitoring tools such as DataDog, Prometheus, Grafana, and Splunk
Microservices ArchitectureUnderstanding of microservices architecture and service mesh technologies
Cloud SecurityKnowledge of security best practices in cloud environments and secure system design

Education

Bachelor's DegreeBachelor's degree in Computer Science, Information Technology, or related field (or equivalent work experience)

Experience

Reliability Engineering ExperienceProven experience as a Software Engineer focused on reliability or similar role in fast-paced, rapidly scaling companies
Problem-Solving SkillsExcellent problem-solving and troubleshooting skills for complex distributed systems
Cross-Functional CollaborationExperience collaborating with cross-functional teams to ensure reliability and scalability in feature development
End-to-End OwnershipTrack record of owning problems end-to-end and accelerating engineering reliability through excellent tooling

Skills

Required skills

Cloud InfrastructureStrong proficiency in major cloud platforms (AWS, GCP, Azure) for scalable system design
Programming LanguagesProficiency in languages like Python, Go, or similar for automation and infrastructure development
Container OrchestrationExperience with Docker containerization and Kubernetes orchestration platforms
Infrastructure as CodeKnowledge of IaC tools such as Terraform or CloudFormation for automated provisioning
Observability and MonitoringExperience with tools like DataDog, Prometheus, Grafana, and Splunk for system monitoring
Problem-SolvingExcellent troubleshooting skills for complex distributed systems and performance optimization
Communication SkillsStrong written and verbal communication abilities for cross-functional collaboration

Nice to have

Microservices ArchitectureUnderstanding of microservices design patterns and service mesh technologies
Cloud SecurityKnowledge of security best practices and compliance requirements in cloud environments
Chaos EngineeringExperience with chaos engineering principles and tools for resilience testing
Load TestingExperience with performance testing tools and methodologies for high-scale systems
SRE PrinciplesDeep understanding of Site Reliability Engineering practices and methodologies

Compensation & benefits

Salary

USD 230,000 – 490,000 (annual)

Stock options

Available

Benefits

Equity Compensation

Stock options and equity participation in OpenAI's growth as a leading AI company

Relocation Assistance

Comprehensive relocation support for new employees moving to San Francisco

Equal Opportunity Employment

Inclusive workplace with non-discrimination policies and equal opportunity for all qualified candidates

Reasonable Accommodations

Support for applicants with disabilities through accommodation requests and accessibility resources

Mission-Driven Work

Opportunity to work on cutting-edge AI technology that benefits humanity and shapes the future of technology


Interview process

  1. 1
    Initial Application Review Resume and portfolio review focusing on reliability engineering experience and technical skills
  2. 2
    Phone/Video Screening Initial conversation with recruiter or hiring manager about background, motivation, and role fit
  3. 3
    Technical Phone Interview Technical discussion covering system design, reliability principles, and problem-solving approach
  4. 4
    System Design Interview In-depth system design session focusing on scalability, reliability, and infrastructure architecture
  5. 5
    Technical Deep Dive Hands-on technical interview covering specific technologies, tools, and real-world scenarios
  6. 6
    Team Collaboration Interview Behavioral interview assessing cross-functional collaboration skills and cultural fit
  7. 7
    Final Interview Round Final interviews with senior team members and stakeholders, including discussion of OpenAI's mission alignment

Apply for this position

You'll be redirected to the company's application page