Software Engineer, Infrastructure Reliability

OpenAI1 months ago

Location

San Francisco

Type

Full Time

Salary

USD 255,000 – 385,000

Level

Senior

Role

Infrastructure Engineer

Posted

Feb 3, 2026

Full TimeSenior

The role

Summary

OpenAI is seeking a Software Engineer for Infrastructure Reliability to join their Applied Infrastructure organization, specifically the Database Systems and Online Storage teams. This role involves scaling and hardening the infrastructure that powers AI systems like ChatGPT and the OpenAI API, ensuring high reliability, observability, performance, and security for millions of global users.

What you'll do

Infrastructure Design and Operation: Design, build, and operate reliable and performant systems used across engineering teams

Performance Optimization: Identify and fix performance bottlenecks and inefficiencies to ensure infrastructure can scale to the next order of magnitude

Complex Issue Resolution: Dig deep to resolve complex technical issues across distributed systems

Automation Development: Continuously improve automation to reduce manual work and enhance developer experience

Incident Response Management: Contribute to incident response, postmortems, and development of best practices around system reliability

Cross-functional Collaboration: Work closely with infrastructure, product, and research teams to turn complex infrastructure into reliable platforms

Technical Direction: Play a key part in shaping technical direction and proactively improving system resilience

Scalability Engineering: Ensure systems can support cutting-edge AI research and deploy at global scale for millions of users

What we look for

Technical

Distributed Systems ExpertiseDeep understanding of distributed systems principles with proven track record in building scalable systems

Kubernetes ExperienceExperience operating Kubernetes at scale and building abstractions over cloud platforms

Cloud Infrastructure ProficiencyStrong proficiency in AWS, GCP, or Azure cloud platforms

Infrastructure as CodeExperience with IaC tools such as Terraform for infrastructure management

Containerization TechnologiesExperience with Docker and container orchestration platforms

Observability ToolsExperience with Datadog, Prometheus, Grafana, Splunk, and ELK stack

Microservices ArchitectureExperience with microservices architecture and service mesh technologies

Linux SystemsComfort working in Linux environments and system administration

Security Best PracticesKnowledge of security best practices in cloud environments

Education

Bachelor's DegreeBachelor's degree in Computer Science, Engineering, or related technical field preferred

Equivalent ExperienceEquivalent professional experience in infrastructure engineering accepted in lieu of formal degree

Experience

Industry Experience4+ years of relevant industry experience in infrastructure or reliability engineering

Leadership Experience2+ years leading large scale, complex projects or teams as an engineer or tech lead

Production EngineeringProven experience as reliability engineer, production engineer, or similar role in fast-paced, scaling company

Performance OptimizationDemonstrated ability to optimize performance in complex, globally-distributed systems

Skills

Required skills

Distributed SystemsDeep understanding of distributed systems principles and architecture patterns

Cloud PlatformsProficiency in AWS, GCP, or Azure cloud infrastructure

KubernetesExperience with container orchestration and cluster management

Infrastructure as CodeTerraform and other IaC tools for infrastructure automation

Programming LanguagesProficiency in Python, Go, or similar languages for infrastructure tooling

ObservabilityExperience with monitoring, logging, and alerting systems

Linux AdministrationStrong Linux systems knowledge and command-line proficiency

Nice to have

AI/ML InfrastructureExperience with infrastructure supporting machine learning workloads

Service MeshKnowledge of Istio, Envoy, or similar service mesh technologies

Database OptimizationExperience optimizing database performance and scaling

Security EngineeringBackground in security practices for cloud-native applications

GitOpsExperience with GitOps workflows and continuous deployment

Multi-cloud StrategyExperience managing infrastructure across multiple cloud providers

Compensation & benefits

Salary

USD 255,000 – 385,000 (annual)

Stock options

Available

Benefits

Equity Compensation

Stock options and equity participation in OpenAI's growth

Health Insurance

Comprehensive medical, dental, and vision coverage

Professional Development

Opportunities to work with cutting-edge AI technology and continuous learning

Work-Life Balance

Flexible work arrangements and competitive time off policies

Retirement Benefits

401(k) plan with company matching contributions

Disability Accommodations

Reasonable accommodations provided for employees with disabilities

Interview process

1
Initial Screening — Phone or video call with recruiting team to discuss background and role fit
2
Technical Phone Screen — 45-60 minute technical interview covering system design and infrastructure concepts
3
System Design Interview — Deep dive into distributed systems architecture and scalability challenges
4
Technical Deep Dive — Hands-on technical discussion about past projects and infrastructure experience
5
Behavioral Interview — Assessment of cultural fit, collaboration skills, and alignment with OpenAI's mission
6
Final Round — On-site or virtual panel interviews with team members and leadership

Apply for this position

You'll be redirected to the company's application page

More Jobs at OpenAI

75 other open positions

View all

Software Engineer, Delivery / CD

San Francisco

Senior

Engineering Manager ChatGPT Infra

London, UK

Manager

Software Engineer, Ads Monetization, Revenue Platform

San Francisco

Senior

iOS Engineer, ChatGPT Mobile Infrastructure

San Francisco

Staff

Android Engineer, ChatGPT Mobile Infrastructure

San Francisco

Staff

OpenAI

View all jobs

OpenAI is an American artificial intelligence research organization developing advanced AI models like GPT. Focused on ensuring AI benefits humanity, it creates tools for natural language processing and generative AI applications.

San Francisco, California, United StatesFounded 2015openai.com

Tech Stack

Languages

PythonGoBash/ShellSQL

Frameworks

KubernetesTerraformService Mesh

Databases

PostgreSQLRedisMongoDB

Tools

DockerDatadogPrometheusGrafanaCI/CD PipelinesELK Stack

Other

AWSGCPAzureLinux

Interview Guides

5 guides available for OpenAI

Apply Now

Software Engineer, Infrastructure Reliability

The role

Summary

What you'll do

What we look for

Technical

Education

Experience

Skills

Required skills

Nice to have

Compensation & benefits

Benefits

Interview process

More Jobs at OpenAI

OpenAI

Tech Stack

Interview Guides

On this page