OpenAI

Software Engineer, Cloud Infrastructure

OpenAI5 months ago
Location

London, UK

Type

Full Time

Salary

GBP 90,000 – 150,000

Level

Senior

Role

DevOps Engineer

Posted

Oct 1, 2025

Full TimeSenior

The role

Summary

OpenAI is seeking a Senior Cloud Infrastructure Engineer to join their Applications Engineering team in London, responsible for building and maintaining the core infrastructure that powers ChatGPT and the OpenAI API. The role requires 5+ years of infrastructure experience with Kubernetes at scale and expertise in cloud platform abstractions.

What you'll do

Infrastructure Platform Design: Design and build development and production platforms that power OpenAI products like ChatGPT and API, ensuring reliability and security at scale
Scalability Engineering: Ensure infrastructure can scale to the next order of magnitude to support growing user base and computational demands
Kubernetes Management: Operate and maintain Kubernetes clusters at scale, managing container orchestration and microservices architecture
Cloud Abstraction Development: Build infrastructure abstractions over cloud platforms to enable rapid product development and deployment
System Reliability: Maintain high availability and reliability of critical infrastructure systems supporting millions of users globally
On-Call Incident Response: Participate in on-call rotation to respond to critical incidents and ensure rapid resolution of system issues
Infrastructure Automation: Develop and maintain infrastructure deployment pipelines, monitoring systems, and automation tools
Cross-Team Collaboration: Work closely with research, engineering, product, and design teams to support AI model deployment and scaling
Security Implementation: Implement and maintain security best practices across infrastructure components and deployment processes
Performance Optimization: Monitor, analyze, and optimize system performance to ensure efficient resource utilization and cost management

What we look for

Technical

Core Infrastructure Experience5+ years of hands-on experience building and maintaining core infrastructure systems at scale
Kubernetes ExpertiseExtensive experience operating Kubernetes orchestration systems at enterprise scale with high availability requirements
Cloud Platform AbstractionsProven experience building abstractions and tooling over major cloud platforms (AWS, GCP, Azure)
Infrastructure as CodeProficiency with Infrastructure as Code tools like Terraform, Ansible, or similar automation frameworks
ContainerizationDeep understanding of Docker containerization, image management, and container security best practices
Networking and SecurityStrong knowledge of network protocols, load balancing, service mesh, and infrastructure security principles
Monitoring and ObservabilityExperience with monitoring systems like Prometheus, Grafana, and distributed tracing for large-scale systems
CI/CD Pipeline ManagementExpertise in building and maintaining continuous integration and deployment pipelines for infrastructure and applications

Education

Bachelor's DegreeBachelor's degree in Computer Science, Engineering, or equivalent practical experience in infrastructure engineering
Advanced CertificationsProfessional cloud certifications (AWS Solutions Architect, Google Cloud Professional, Azure Solutions Architect) preferred but not required

Experience

Scalable Systems ExperienceDemonstrated experience building and operating scalable, reliable, secure systems in production environments
High-Growth EnvironmentComfortable working in ambiguous, rapidly changing environments with evolving technical requirements
Team LeadershipExperience mentoring junior engineers and contributing to technical decision-making processes
Incident ManagementProven track record in incident response, post-mortem analysis, and implementing preventive measures

Skills

Required skills

Kubernetes AdministrationExpert-level skills in Kubernetes cluster management, including deployment, scaling, monitoring, and troubleshooting
Cloud InfrastructureAdvanced knowledge of cloud platforms (AWS/GCP/Azure) including compute, storage, networking, and managed services
Infrastructure as CodeProficiency with Terraform, Ansible, or similar tools for automated infrastructure provisioning and configuration management
System ArchitectureStrong understanding of distributed systems architecture, microservices patterns, and scalability principles
Programming SkillsSolid programming experience in Python, Go, or similar languages for automation, tooling, and infrastructure development
Linux/Unix AdministrationAdvanced system administration skills including shell scripting, process management, and performance tuning
Monitoring and AlertingExperience with monitoring tools like Prometheus, Grafana, ELK stack, and implementing effective alerting strategies
Security Best PracticesKnowledge of infrastructure security, secrets management, network security, and compliance requirements

Nice to have

Service Mesh TechnologiesExperience with Istio, Linkerd, or similar service mesh technologies for microservices communication
GitOps WorkflowsFamiliarity with GitOps practices using ArgoCD, Flux, or similar tools for declarative infrastructure management
Multi-Cloud ArchitectureExperience designing and implementing multi-cloud or hybrid cloud infrastructure solutions
AI/ML InfrastructureUnderstanding of machine learning infrastructure requirements, GPU computing, and model serving platforms
Cost OptimizationExperience with cloud cost optimization strategies, resource right-sizing, and financial operations (FinOps)
Open Source ContributionsActive contributions to open source infrastructure projects or cloud-native ecosystem tools

Compensation & benefits

Salary

GBP 90,000 – 150,000 (annual)

Stock options

Available

Benefits

Equity Participation

Meaningful equity stake in one of the world's leading AI companies with significant growth potential

Health Insurance

Comprehensive medical, dental, and vision coverage for employees and dependents through top UK providers

Flexible Time Off

Unlimited PTO policy allowing for work-life balance and personal time management

Professional Development

Annual learning and development budget for conferences, courses, certifications, and skill advancement

Remote Work Support

Home office setup stipend and flexible hybrid work arrangements with modern London office space

Parental Leave

Extended parental leave policies exceeding UK statutory requirements for new parents

Pension Scheme

Competitive pension contribution matching to support long-term financial planning

Mental Health Support

Employee assistance programs and mental health resources including counseling services

Technology Allowance

Latest MacBook Pro, additional monitors, and choice of productivity tools and software

Team Events

Regular team building activities, company retreats, and social events to foster collaboration


Interview process

  1. 1
    Initial Screen 30-minute phone/video call with talent acquisition to discuss background, role fit, and answer initial questions about OpenAI
  2. 2
    Technical Phone Screen 45-minute technical discussion with a senior engineer covering infrastructure concepts, system design, and problem-solving approach
  3. 3
    System Design Interview 60-minute system design session focusing on large-scale infrastructure architecture, scalability patterns, and cloud platform design
  4. 4
    Technical Deep Dive 90-minute hands-on technical interview covering Kubernetes, cloud infrastructure, monitoring, and real-world problem scenarios
  5. 5
    Behavioral Interview 45-minute discussion with hiring manager about leadership experience, cultural fit, handling ambiguity, and alignment with OpenAI values
  6. 6
    Final Round 30-minute conversation with senior leadership to discuss career goals, team dynamics, and mutual fit for the role
  7. 7
    Reference Checks Professional reference verification with previous managers and colleagues to validate technical skills and work style

Apply for this position

You'll be redirected to the company's application page