Founding Reliability & Performance Engineer

LiteLLM4 weeks ago

Location

San Francisco

Type

Full Time

Salary

USD 200,000 – 270,000

Level

Senior

Role

Site Reliability Engineer

Posted

Mar 4, 2026

Full TimeSenior

The role

Summary

LiteLLM is seeking a Founding Reliability & Performance Engineer to be the first dedicated reliability hire, responsible for ensuring the stability and performance of their critical AI infrastructure platform. The ideal candidate will own production reliability, performance engineering, and observability for a high-impact open-source project serving major tech companies like NASA, Netflix, and Adobe.

What you'll do

Production Reliability: Manage on-call rotations, handle incident response, conduct blameless post-mortems, and support enterprise customer escalations

Performance Optimization: Detect and prevent memory leaks, optimize hot paths, establish performance benchmarks, and ensure system performance under high load

Observability: Implement structured logging, distributed tracing, develop accurate Prometheus metrics, and define/track SLOs for enterprise customers

Release Management: Build canary deployment strategies, implement automated rollback mechanisms, and ensure release safety

What we look for

Technical

Python Production ExperienceMinimum 2+ years running Python services in production with scale debugging experience

Async PythonDeep understanding of asyncio event loop, session management, and connection pooling

Database PerformanceExpertise in PostgreSQL connection pool tuning and query optimization

Education

Computer ScienceBachelor's degree in Computer Science, Software Engineering, or related field preferred

Experience

On-Call ExperiencePrevious experience with production on-call rotations

ScalabilityDemonstrated ability to handle performance challenges in distributed systems

Skills

Required skills

PythonExtensive experience with Python production services, especially async Python internals

KubernetesOperational-level knowledge of Kubernetes, pod lifecycle, and resource management

PostgreSQLAdvanced understanding of database performance, connection pool tuning, and query optimization

Performance EngineeringExpertise in identifying and resolving performance bottlenecks, memory leaks, and latency issues

Nice to have

Infrastructure ExperienceBackground in production engineering at companies like Meta, Cloudflare, Fastly, or similar infrastructure companies

Open SourceContributions to open-source infrastructure projects

Async ProgrammingDeep understanding of HTTP/2, streaming responses, and async concurrency patterns

Compensation & benefits

Salary

USD 200,000 – 270,000 (annual)

Stock options

Available

Benefits

Equity

Meaningful startup equity at a high-growth stage

Impact

Opportunity to define reliability practices for a critical AI infrastructure platform

Visibility

Contributions visible to the entire AI infrastructure community

Open Source

Work on a project with 36K GitHub stars

Interview process

1
Initial Screening — Technical resume review and initial phone/video call
2
Technical Interview — Deep dive into performance engineering, Python async, and system reliability experience
3
System Design Challenge — Evaluate candidate's approach to solving complex reliability and performance problems
4
On-Site/Final Interview — Meet with engineering team, discuss potential contributions and role alignment

Apply for this position

You'll be redirected to the company's application page

More Jobs at LiteLLM

5 other open positions

View all

Senior Backend Engineer

San Francisco

Senior

Backend Engineer (New Grad)

Mid

Forward Deployed Engineer

San Francisco

Senior

Site Reliability Engineer

San Francisco

Mid

LiteLLM

View all jobs

LiteLLM is a platform that provides a unified interface for accessing multiple large language models (LLMs) from different providers. The company offers tools and infrastructure that enable developers and organizations to seamlessly integrate various AI models into their applications while managing costs, performance, and deployment complexity. LiteLLM operates in the artificial intelligence and developer tools market, serving businesses that need flexible access to different language models without being locked into a single provider's ecosystem.

litellm.ai

Tech Stack

Languages

Python

Frameworks

asyncioaiohttp

Databases

PostgreSQLRedis

Tools

KubernetesPrometheusDistributed Tracing

Other

Memraypy-spytracemalloc

Apply Now

Founding Reliability & Performance Engineer

The role

Summary

What you'll do

What we look for

Technical

Education

Experience

Skills

Required skills

Nice to have

Compensation & benefits

Benefits

Interview process

More Jobs at LiteLLM

LiteLLM

Tech Stack

On this page