Software Engineer, Cloud Infrastructure

DevOps Engineer · Senior · Full Time

London, UKGBP 90k – 150k10mo ago

Opens OpenAI's application page

Role

What you'll do.

OpenAI is seeking a Senior Cloud Infrastructure Engineer to join their Applications Engineering team in London, responsible for building and maintaining the core infrastructure that powers ChatGPT and the OpenAI API. The role requires 5+ years of infrastructure experience with Kubernetes at scale and expertise in cloud platform abstractions.

Responsibilities

Infrastructure Platform Design: Design and build development and production platforms that power OpenAI products like ChatGPT and API, ensuring reliability and security at scale
Scalability Engineering: Ensure infrastructure can scale to the next order of magnitude to support growing user base and computational demands
Kubernetes Management: Operate and maintain Kubernetes clusters at scale, managing container orchestration and microservices architecture
Cloud Abstraction Development: Build infrastructure abstractions over cloud platforms to enable rapid product development and deployment
System Reliability: Maintain high availability and reliability of critical infrastructure systems supporting millions of users globally
On-Call Incident Response: Participate in on-call rotation to respond to critical incidents and ensure rapid resolution of system issues
Infrastructure Automation: Develop and maintain infrastructure deployment pipelines, monitoring systems, and automation tools
Cross-Team Collaboration: Work closely with research, engineering, product, and design teams to support AI model deployment and scaling
Security Implementation: Implement and maintain security best practices across infrastructure components and deployment processes
Performance Optimization: Monitor, analyze, and optimize system performance to ensure efficient resource utilization and cost management

Qualifications

What we look for.

Technical

Core Infrastructure Experience
5+ years of hands-on experience building and maintaining core infrastructure systems at scale
Kubernetes Expertise
Extensive experience operating Kubernetes orchestration systems at enterprise scale with high availability requirements
Cloud Platform Abstractions
Proven experience building abstractions and tooling over major cloud platforms (AWS, GCP, Azure)
Infrastructure as Code
Proficiency with Infrastructure as Code tools like Terraform, Ansible, or similar automation frameworks
Containerization
Deep understanding of Docker containerization, image management, and container security best practices
Networking and Security
Strong knowledge of network protocols, load balancing, service mesh, and infrastructure security principles
Monitoring and Observability
Experience with monitoring systems like Prometheus, Grafana, and distributed tracing for large-scale systems
CI/CD Pipeline Management
Expertise in building and maintaining continuous integration and deployment pipelines for infrastructure and applications

Education

Bachelor's Degree
Bachelor's degree in Computer Science, Engineering, or equivalent practical experience in infrastructure engineering
Advanced Certifications
Professional cloud certifications (AWS Solutions Architect, Google Cloud Professional, Azure Solutions Architect) preferred but not required

Experience

Scalable Systems Experience
Demonstrated experience building and operating scalable, reliable, secure systems in production environments
High-Growth Environment
Comfortable working in ambiguous, rapidly changing environments with evolving technical requirements
Team Leadership
Experience mentoring junior engineers and contributing to technical decision-making processes
Incident Management
Proven track record in incident response, post-mortem analysis, and implementing preventive measures

Skills

Required

Kubernetes Administration
Expert-level skills in Kubernetes cluster management, including deployment, scaling, monitoring, and troubleshooting
Cloud Infrastructure
Advanced knowledge of cloud platforms (AWS/GCP/Azure) including compute, storage, networking, and managed services
Infrastructure as Code
Proficiency with Terraform, Ansible, or similar tools for automated infrastructure provisioning and configuration management
System Architecture
Strong understanding of distributed systems architecture, microservices patterns, and scalability principles
Programming Skills
Solid programming experience in Python, Go, or similar languages for automation, tooling, and infrastructure development
Linux/Unix Administration
Advanced system administration skills including shell scripting, process management, and performance tuning
Monitoring and Alerting
Experience with monitoring tools like Prometheus, Grafana, ELK stack, and implementing effective alerting strategies
Security Best Practices
Knowledge of infrastructure security, secrets management, network security, and compliance requirements

Preferred

Service Mesh Technologies
Nice to have
Experience with Istio, Linkerd, or similar service mesh technologies for microservices communication
GitOps Workflows
Nice to have
Familiarity with GitOps practices using ArgoCD, Flux, or similar tools for declarative infrastructure management
Multi-Cloud Architecture
Nice to have
Experience designing and implementing multi-cloud or hybrid cloud infrastructure solutions
AI/ML Infrastructure
Nice to have
Understanding of machine learning infrastructure requirements, GPU computing, and model serving platforms
Cost Optimization
Nice to have
Experience with cloud cost optimization strategies, resource right-sizing, and financial operations (FinOps)
Open Source Contributions
Nice to have
Active contributions to open source infrastructure projects or cloud-native ecosystem tools

Tech stack

Languages

PythonGoJavaScript/TypeScriptBash/Shell

Frameworks

KubernetesHelmTerraformAnsible

Databases

PostgreSQLRedisInfluxDBElasticsearch

Tools

DockerJenkinsPrometheusGrafanaGitLab CI/CDVaultIstio

Other

AWS/GCP/AzureLinux/UnixNetworkingSecurityObservability

Compensation

Pay and benefits.

Base·GBP 90,000 – 150,000

Equity·Stock options

Benefits

Equity Participation
Meaningful equity stake in one of the world's leading AI companies with significant growth potential
Health Insurance
Comprehensive medical, dental, and vision coverage for employees and dependents through top UK providers
Flexible Time Off
Unlimited PTO policy allowing for work-life balance and personal time management
Professional Development
Annual learning and development budget for conferences, courses, certifications, and skill advancement
Remote Work Support
Home office setup stipend and flexible hybrid work arrangements with modern London office space
Parental Leave
Extended parental leave policies exceeding UK statutory requirements for new parents
Pension Scheme
Competitive pension contribution matching to support long-term financial planning
Mental Health Support
Employee assistance programs and mental health resources including counseling services
Technology Allowance
Latest MacBook Pro, additional monitors, and choice of productivity tools and software
Team Events
Regular team building activities, company retreats, and social events to foster collaboration

Process

Interview steps.

01
Initial Screen
30-minute phone/video call with talent acquisition to discuss background, role fit, and answer initial questions about OpenAI
02
Technical Phone Screen
45-minute technical discussion with a senior engineer covering infrastructure concepts, system design, and problem-solving approach
03
System Design Interview
60-minute system design session focusing on large-scale infrastructure architecture, scalability patterns, and cloud platform design
04
Technical Deep Dive
90-minute hands-on technical interview covering Kubernetes, cloud infrastructure, monitoring, and real-world problem scenarios
05
Behavioral Interview
45-minute discussion with hiring manager about leadership experience, cultural fit, handling ambiguity, and alignment with OpenAI values
06
Final Round
30-minute conversation with senior leadership to discuss career goals, team dynamics, and mutual fit for the role
07
Reference Checks
Professional reference verification with previous managers and colleagues to validate technical skills and work style

Full posting

Original listing.

About the Team

The Applications Engineering team works across research, engineering, product, and design to bring OpenAI’s technology to consumers and businesses.

You’ll join the team responsible for running the core infrastructure that supports products like ChatGPT and the API. The systems we support include our kubernetes clusters, infrastructure deployment, our networking stack, cloud abstractions, and more.

We seek to learn from deployment and distribute the benefits of AI, while ensuring that this powerful tool is used responsibly and safely. Safety is more important to us than unfettered growth.

About the Role

The cloud infrastructure team builds and maintains infrastructure abstractions allowing OpenAI to ship products quickly and scalably.

In this role, you will:

Design and build the development and production platforms that power our products, enabling reliability and security at scale
Ensure our infrastructure can scale to the next order of magnitude
Help create a diverse, equitable, and inclusive culture that makes all feel welcome while enabling radical candor and the challenging of group think
Like all other teams, we are responsible for the reliability of the systems we build. This includes an on-call rotation to respond to critical incidents as needed.

You might thrive in this role if you:

Have 5+ years building core infrastructure
Have experience operating orchestration systems such as Kubernetes at scale
Have experience building abstractions over cloud platforms
Take pride in building and operating scalable, reliable, secure systems
Are comfortable with ambiguity and rapid change

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.

For additional information, please see OpenAI’s Affirmative Action and Equal Employment Opportunity Policy Statement.

Background checks for applicants will be administered in accordance with applicable law, and qualified applicants with arrest or conviction records will be considered for employment consistent with those laws, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act, for US-based candidates. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.

To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance.

We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.

OpenAI Global Applicant Privacy Policy

At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.

Interview prep

5 guides for OpenAI

Apply for this position

Redirects to OpenAI's application page.

Other roles

More at OpenAI.

View all 125 roles