OpenAI

Software Engineer, Caching Infrastructure

OpenAI8 months ago
Location

San Francisco

Type

Full Time

Salary

USD 230,000 – 385,000

Level

Senior

Role

Backend Engineer

Posted

Jul 18, 2025

Full TimeSenior

The role

Summary

OpenAI is seeking a Senior Software Engineer to design and scale their multi-tenant caching infrastructure that powers ChatGPT, APIs, and other AI products. The role involves building high-availability distributed caching systems using Redis/Memcached and Kubernetes to support inference, identity, and product experiences across OpenAI's platform.

What you'll do

Platform Architecture: Design, build, and operate OpenAI's multi-tenant caching platform used across inference, identity, quota, and product experiences
Strategic Planning: Define the long-term vision and roadmap for caching as a core infrastructure capability, balancing performance, durability, and cost
Cross-Team Collaboration: Collaborate with infrastructure teams (networking, observability, databases) and product teams to ensure caching platform meets their needs
Performance Optimization: Optimize cache performance, minimize tail latency, and ensure high availability across diverse use cases
Scalability Engineering: Build autoscaling systems that dynamically adjust to workload demands while maintaining cost efficiency
System Monitoring: Implement comprehensive observability, monitoring, and alerting for distributed caching infrastructure
Capacity Planning: Analyze usage patterns and plan infrastructure capacity to support OpenAI's growing AI model deployment needs
Incident Response: Participate in on-call rotation and lead incident response for caching infrastructure issues

What we look for

Technical

Distributed Systems5+ years experience building and scaling distributed systems with focus on caching, load balancing, or storage
Redis/Memcached ExpertiseDeep expertise with Redis, Memcached including clustering, durability configurations, client-side connection patterns, and performance tuning
Kubernetes ProductionProduction experience with Kubernetes, service meshes (Envoy), and autoscaling systems
Performance EngineeringRigorous thinking about latency, reliability, throughput, and cost in platform design
Network ProtocolsStrong understanding of networking fundamentals, TCP/IP, load balancing, and service discovery

Education

Computer Science DegreeBachelor's or Master's degree in Computer Science, Engineering, or equivalent practical experience

Experience

Senior Engineering5+ years in senior software engineering roles with increasing responsibility
Infrastructure ScaleExperience building infrastructure serving millions of users or high-throughput AI/ML workloads
Fast-Paced EnvironmentAbility to thrive in fast-paced environment balancing pragmatic engineering with long-term technical excellence

Skills

Required skills

Redis/MemcachedExpert-level knowledge of distributed caching systems
KubernetesProduction experience with container orchestration
Distributed SystemsDeep understanding of consensus algorithms, consistency models, and fault tolerance
Performance EngineeringExperience with latency optimization and throughput scaling
Service MeshHands-on experience with Envoy, Istio, or similar technologies
Monitoring & ObservabilityProficiency with Prometheus, Grafana, and distributed tracing

Nice to have

Go ProgrammingStrong Go development skills for backend systems
RustExperience with Rust for performance-critical components
Cloud PlatformsExperience with AWS, GCP, or Azure infrastructure
AI/ML InfrastructureUnderstanding of AI model serving and inference infrastructure
Database SystemsKnowledge of PostgreSQL, ClickHouse, or other database technologies
Network ProgrammingLow-level networking and protocol implementation experience

Compensation & benefits

Salary

USD 230,000 – 385,000 (annual)

Stock options

Available

Benefits

Equity Compensation

Significant equity package in one of the world's leading AI companies

Health Insurance

Comprehensive medical, dental, and vision coverage

Learning Budget

Professional development and conference attendance budget

Flexible PTO

Unlimited paid time off policy

Parental Leave

Generous parental leave for new parents

Commuter Benefits

Transportation and parking assistance

Wellness Programs

Mental health support and wellness initiatives

AI Research Access

Early access to cutting-edge AI models and research


Interview process

  1. 1
    Initial Screening 30-minute recruiter call covering background, interest in OpenAI, and basic technical experience
  2. 2
    Technical Phone Screen 60-minute technical interview focusing on distributed systems design and caching concepts
  3. 3
    System Design Interview 90-minute session designing a large-scale caching infrastructure with Redis clustering and Kubernetes
  4. 4
    Code Implementation 75-minute coding interview implementing cache algorithms, consistency patterns, or performance optimization
  5. 5
    Cross-Team Collaboration 45-minute behavioral interview with infrastructure team focusing on collaboration and communication
  6. 6
    Leadership Discussion 60-minute final interview with engineering leadership covering vision, technical direction, and culture fit

Apply for this position

You'll be redirected to the company's application page