OpenAI

Software Engineer, Infrastructure Reliability

OpenAI1 months ago
Location

San Francisco

Type

Full Time

Salary

USD 255,000 – 385,000

Level

Senior

Role

Infrastructure Engineer

Posted

Feb 3, 2026

Full TimeSenior

The role

Summary

OpenAI is seeking a Software Engineer for Infrastructure Reliability to join their Applied Infrastructure organization, specifically the Database Systems and Online Storage teams. This role involves scaling and hardening the infrastructure that powers AI systems like ChatGPT and the OpenAI API, ensuring high reliability, observability, performance, and security for millions of global users.

What you'll do

Infrastructure Design and Operation: Design, build, and operate reliable and performant systems used across engineering teams
Performance Optimization: Identify and fix performance bottlenecks and inefficiencies to ensure infrastructure can scale to the next order of magnitude
Complex Issue Resolution: Dig deep to resolve complex technical issues across distributed systems
Automation Development: Continuously improve automation to reduce manual work and enhance developer experience
Incident Response Management: Contribute to incident response, postmortems, and development of best practices around system reliability
Cross-functional Collaboration: Work closely with infrastructure, product, and research teams to turn complex infrastructure into reliable platforms
Technical Direction: Play a key part in shaping technical direction and proactively improving system resilience
Scalability Engineering: Ensure systems can support cutting-edge AI research and deploy at global scale for millions of users

What we look for

Technical

Distributed Systems ExpertiseDeep understanding of distributed systems principles with proven track record in building scalable systems
Kubernetes ExperienceExperience operating Kubernetes at scale and building abstractions over cloud platforms
Cloud Infrastructure ProficiencyStrong proficiency in AWS, GCP, or Azure cloud platforms
Infrastructure as CodeExperience with IaC tools such as Terraform for infrastructure management
Containerization TechnologiesExperience with Docker and container orchestration platforms
Observability ToolsExperience with Datadog, Prometheus, Grafana, Splunk, and ELK stack
Microservices ArchitectureExperience with microservices architecture and service mesh technologies
Linux SystemsComfort working in Linux environments and system administration
Security Best PracticesKnowledge of security best practices in cloud environments

Education

Bachelor's DegreeBachelor's degree in Computer Science, Engineering, or related technical field preferred
Equivalent ExperienceEquivalent professional experience in infrastructure engineering accepted in lieu of formal degree

Experience

Industry Experience4+ years of relevant industry experience in infrastructure or reliability engineering
Leadership Experience2+ years leading large scale, complex projects or teams as an engineer or tech lead
Production EngineeringProven experience as reliability engineer, production engineer, or similar role in fast-paced, scaling company
Performance OptimizationDemonstrated ability to optimize performance in complex, globally-distributed systems

Skills

Required skills

Distributed SystemsDeep understanding of distributed systems principles and architecture patterns
Cloud PlatformsProficiency in AWS, GCP, or Azure cloud infrastructure
KubernetesExperience with container orchestration and cluster management
Infrastructure as CodeTerraform and other IaC tools for infrastructure automation
Programming LanguagesProficiency in Python, Go, or similar languages for infrastructure tooling
ObservabilityExperience with monitoring, logging, and alerting systems
Linux AdministrationStrong Linux systems knowledge and command-line proficiency

Nice to have

AI/ML InfrastructureExperience with infrastructure supporting machine learning workloads
Service MeshKnowledge of Istio, Envoy, or similar service mesh technologies
Database OptimizationExperience optimizing database performance and scaling
Security EngineeringBackground in security practices for cloud-native applications
GitOpsExperience with GitOps workflows and continuous deployment
Multi-cloud StrategyExperience managing infrastructure across multiple cloud providers

Compensation & benefits

Salary

USD 255,000 – 385,000 (annual)

Stock options

Available

Benefits

Equity Compensation

Stock options and equity participation in OpenAI's growth

Health Insurance

Comprehensive medical, dental, and vision coverage

Professional Development

Opportunities to work with cutting-edge AI technology and continuous learning

Work-Life Balance

Flexible work arrangements and competitive time off policies

Retirement Benefits

401(k) plan with company matching contributions

Disability Accommodations

Reasonable accommodations provided for employees with disabilities


Interview process

  1. 1
    Initial Screening Phone or video call with recruiting team to discuss background and role fit
  2. 2
    Technical Phone Screen 45-60 minute technical interview covering system design and infrastructure concepts
  3. 3
    System Design Interview Deep dive into distributed systems architecture and scalability challenges
  4. 4
    Technical Deep Dive Hands-on technical discussion about past projects and infrastructure experience
  5. 5
    Behavioral Interview Assessment of cultural fit, collaboration skills, and alignment with OpenAI's mission
  6. 6
    Final Round On-site or virtual panel interviews with team members and leadership

Apply for this position

You'll be redirected to the company's application page