LiteLLM

Site Reliability Engineer

LiteLLM4 weeks ago
Location

San Francisco

Type

Full Time

Salary

USD 150,000 – 200,000

Level

Mid

Role

Site Reliability Engineer

Posted

Mar 4, 2026

Full TimeMid

The role

Summary

LiteLLM is seeking a Site Reliability Engineer (SRE) to ensure the reliability and performance of their AI Gateway platform used by major companies like Adobe and Netflix. The ideal candidate will own critical production infrastructure, addressing complex system challenges in a high-impact, open-source environment.

What you'll do

Production Reliability: Ensure high availability and performance of the LiteLLM proxy handling millions of LLM requests
System Optimization: Diagnose and resolve complex system issues including OOM problems, database connection challenges, and race conditions
Performance Monitoring: Implement and maintain Prometheus metrics, improve observability, and create robust alerting mechanisms
Infrastructure Resilience: Develop self-healing mechanisms for the proxy, including graceful degradation and connection retry logic
Technical Debugging: Resolve intricate technical challenges like memory leaks, connection pool exhaustion, and caching synchronization issues

What we look for

Technical

Production Python Experience1-4 years of experience running Python services at scale
System DebuggingProven ability to troubleshoot OOM issues, memory leaks, and connection pool problems
Database ExpertiseAdvanced knowledge of PostgreSQL query optimization and connection pooling

Education

Computer ScienceBachelor's degree in Computer Science, Software Engineering, or related technical field preferred

Experience

Kubernetes ManagementHands-on experience with pod restarts, resource limits, and multi-replica coordination
Monitoring ToolsProficiency with Prometheus/Grafana for comprehensive system monitoring

Skills

Required skills

PythonCore programming language for the platform
KubernetesContainer orchestration and deployment management
PostgreSQLDatabase management and query optimization
RedisIn-memory data structure store for caching and session management
PrometheusMonitoring and alerting system for production environments

Nice to have

FastAPIWeb framework for building APIs
DockerContainerization platform
Performance ProfilingAbility to analyze and optimize system performance

Compensation & benefits

Salary

USD 150,000 – 200,000 (annual)

Stock options

Available

Benefits

Open Source Contribution

Opportunity to work on and contribute to popular open-source projects

Cutting-edge AI Technology

Work with advanced AI infrastructure used by leading global companies

Direct Impact

Work closely with CEO and CTO on critical technical challenges


Interview process

  1. 1
    Initial Screening Phone or video call with recruiting team to assess basic qualifications
  2. 2
    Technical Interview In-depth discussion of system reliability, debugging experiences, and technical problem-solving
  3. 3
    Systems Design Challenge Evaluate candidate's approach to complex infrastructure and reliability challenges
  4. 4
    Team Fit Interview Meeting with current engineering team to assess collaboration and cultural alignment

Apply for this position

You'll be redirected to the company's application page