Site Reliability Engineer

LiteLLM4 weeks ago

Location

San Francisco

Type

Full Time

Salary

USD 150,000 – 200,000

Level

Mid

Role

Site Reliability Engineer

Posted

Mar 4, 2026

Full TimeMid

The role

Summary

LiteLLM is seeking a Site Reliability Engineer (SRE) to ensure the reliability and performance of their AI Gateway platform used by major companies like Adobe and Netflix. The ideal candidate will own critical production infrastructure, addressing complex system challenges in a high-impact, open-source environment.

What you'll do

Production Reliability: Ensure high availability and performance of the LiteLLM proxy handling millions of LLM requests

System Optimization: Diagnose and resolve complex system issues including OOM problems, database connection challenges, and race conditions

Performance Monitoring: Implement and maintain Prometheus metrics, improve observability, and create robust alerting mechanisms

Infrastructure Resilience: Develop self-healing mechanisms for the proxy, including graceful degradation and connection retry logic

Technical Debugging: Resolve intricate technical challenges like memory leaks, connection pool exhaustion, and caching synchronization issues

What we look for

Technical

Production Python Experience1-4 years of experience running Python services at scale

System DebuggingProven ability to troubleshoot OOM issues, memory leaks, and connection pool problems

Database ExpertiseAdvanced knowledge of PostgreSQL query optimization and connection pooling

Education

Computer ScienceBachelor's degree in Computer Science, Software Engineering, or related technical field preferred

Experience

Kubernetes ManagementHands-on experience with pod restarts, resource limits, and multi-replica coordination

Monitoring ToolsProficiency with Prometheus/Grafana for comprehensive system monitoring

Skills

Required skills

PythonCore programming language for the platform

KubernetesContainer orchestration and deployment management

PostgreSQLDatabase management and query optimization

RedisIn-memory data structure store for caching and session management

PrometheusMonitoring and alerting system for production environments

Nice to have

FastAPIWeb framework for building APIs

DockerContainerization platform

Performance ProfilingAbility to analyze and optimize system performance

Compensation & benefits

Salary

USD 150,000 – 200,000 (annual)

Stock options

Available

Benefits

Open Source Contribution

Opportunity to work on and contribute to popular open-source projects

Cutting-edge AI Technology

Work with advanced AI infrastructure used by leading global companies

Direct Impact

Work closely with CEO and CTO on critical technical challenges

Interview process

1
Initial Screening — Phone or video call with recruiting team to assess basic qualifications
2
Technical Interview — In-depth discussion of system reliability, debugging experiences, and technical problem-solving
3
Systems Design Challenge — Evaluate candidate's approach to complex infrastructure and reliability challenges
4
Team Fit Interview — Meeting with current engineering team to assess collaboration and cultural alignment

Apply for this position

You'll be redirected to the company's application page

More Jobs at LiteLLM

5 other open positions

View all

Senior Backend Engineer

San Francisco

Senior

Backend Engineer (New Grad)

Mid

Forward Deployed Engineer

San Francisco

Senior

Founding Reliability & Performance Engineer

San Francisco

Senior

LiteLLM

View all jobs

LiteLLM is a platform that provides a unified interface for accessing multiple large language models (LLMs) from different providers. The company offers tools and infrastructure that enable developers and organizations to seamlessly integrate various AI models into their applications while managing costs, performance, and deployment complexity. LiteLLM operates in the artificial intelligence and developer tools market, serving businesses that need flexible access to different language models without being locked into a single provider's ecosystem.

litellm.ai

Tech Stack

Languages

Python

Frameworks

FastAPI

Databases

PostgreSQLRedis

Tools

KubernetesPrometheusDocker

Other

Prisma ORM

Apply Now

Site Reliability Engineer

The role

Summary

What you'll do

What we look for

Technical

Education

Experience

Skills

Required skills

Nice to have

Compensation & benefits

Benefits

Interview process

More Jobs at LiteLLM

LiteLLM

Tech Stack

On this page