LiteLLM

Founding Reliability & Performance Engineer

LiteLLM4 weeks ago
Location

San Francisco

Type

Full Time

Salary

USD 200,000 – 270,000

Level

Senior

Role

Site Reliability Engineer

Posted

Mar 4, 2026

Full TimeSenior

The role

Summary

LiteLLM is seeking a Founding Reliability & Performance Engineer to be the first dedicated reliability hire, responsible for ensuring the stability and performance of their critical AI infrastructure platform. The ideal candidate will own production reliability, performance engineering, and observability for a high-impact open-source project serving major tech companies like NASA, Netflix, and Adobe.

What you'll do

Production Reliability: Manage on-call rotations, handle incident response, conduct blameless post-mortems, and support enterprise customer escalations
Performance Optimization: Detect and prevent memory leaks, optimize hot paths, establish performance benchmarks, and ensure system performance under high load
Observability: Implement structured logging, distributed tracing, develop accurate Prometheus metrics, and define/track SLOs for enterprise customers
Release Management: Build canary deployment strategies, implement automated rollback mechanisms, and ensure release safety

What we look for

Technical

Python Production ExperienceMinimum 2+ years running Python services in production with scale debugging experience
Async PythonDeep understanding of asyncio event loop, session management, and connection pooling
Database PerformanceExpertise in PostgreSQL connection pool tuning and query optimization

Education

Computer ScienceBachelor's degree in Computer Science, Software Engineering, or related field preferred

Experience

On-Call ExperiencePrevious experience with production on-call rotations
ScalabilityDemonstrated ability to handle performance challenges in distributed systems

Skills

Required skills

PythonExtensive experience with Python production services, especially async Python internals
KubernetesOperational-level knowledge of Kubernetes, pod lifecycle, and resource management
PostgreSQLAdvanced understanding of database performance, connection pool tuning, and query optimization
Performance EngineeringExpertise in identifying and resolving performance bottlenecks, memory leaks, and latency issues

Nice to have

Infrastructure ExperienceBackground in production engineering at companies like Meta, Cloudflare, Fastly, or similar infrastructure companies
Open SourceContributions to open-source infrastructure projects
Async ProgrammingDeep understanding of HTTP/2, streaming responses, and async concurrency patterns

Compensation & benefits

Salary

USD 200,000 – 270,000 (annual)

Stock options

Available

Benefits

Equity

Meaningful startup equity at a high-growth stage

Impact

Opportunity to define reliability practices for a critical AI infrastructure platform

Visibility

Contributions visible to the entire AI infrastructure community

Open Source

Work on a project with 36K GitHub stars


Interview process

  1. 1
    Initial Screening Technical resume review and initial phone/video call
  2. 2
    Technical Interview Deep dive into performance engineering, Python async, and system reliability experience
  3. 3
    System Design Challenge Evaluate candidate's approach to solving complex reliability and performance problems
  4. 4
    On-Site/Final Interview Meet with engineering team, discuss potential contributions and role alignment

Apply for this position

You'll be redirected to the company's application page