Member of Technical Staff - Reliability Engineering

Modal5 months ago

Location

New York

Type

Full Time

Salary

USD 150,000 – 350,000

Level

Senior

Role

Site Reliability Engineer

Posted

Jan 19, 2026

Full TimeSenior

The role

Summary

Modal is seeking a systems-focused Reliability Engineer to be the first dedicated reliability hire, responsible for defining and implementing critical infrastructure reliability strategies for their AI cloud platform. The ideal candidate will be a deep systems thinker with extensive production experience, capable of improving system resilience, designing operational processes, and driving reliability culture across the engineering organization.

What you'll do

Reliability Architecture: Identify and implement architectural improvements to enhance system reliability, performance, and availability of Modal's cloud infrastructure.

Operational Process Design: Design and implement critical operational processes including deployments, upgrades, rollbacks, and comprehensive postmortem reviews.

Monitoring and Observability: Build robust monitoring systems to ensure high-quality service delivery and proactively identify potential system issues.

Production Incident Management: Participate in on-call rotations, respond to production incidents, and debug complex issues across all service levels and stack components.

Reliability Culture Development: Foster a strong culture of reliability and system resilience across Modal's engineering organization.

What we look for

Technical

Cloud InfrastructureDeep expertise in cloud technologies, with strong preference for AWS and hyperscaler cloud platforms

Kubernetes ManagementExperience scaling Kubernetes clusters to thousands of nodes and managing complex container orchestration environments

Systems DesignAdvanced understanding of systems safety research, control theory, and large-scale distributed system architectures

Education

Computer Science or EngineeringBachelor's or Master's degree in Computer Science, Software Engineering, or related technical field preferred

Experience

Production Engineering5+ years of experience writing high-quality production code in complex cloud environments

On-Call ExperienceMinimum 2 years of on-call experience managing critical production services

Skills

Required skills

Cloud InfrastructureExpertise in cloud platforms, particularly AWS, with strong understanding of infrastructure management

Systems ProgrammingAdvanced systems programming skills with ability to write high-performance, reliable production code

Incident ResponseProven ability to diagnose and resolve complex production issues across multiple system layers

Nice to have

Systems Safety ResearchBackground or experience with STAMP (Systems-Theoretic Accident Model and Processes) and control theory

Capacity PlanningExperience with auto-scaling, fleet management, and large-scale infrastructure capacity planning

Compensation & benefits

Salary

USD 150,000 – 350,000 (annual)

Stock options

Available

Benefits

Equity Compensation

Stock options in a high-growth AI infrastructure startup valued at $1.1B

Career Growth

Opportunity to be the first reliability-focused hire and shape the company's reliability practices

Innovative Work Environment

Join a team of open-source creators, academic researchers, and experienced engineering leaders

Interview process

1
Initial Screening — Technical resume review and initial recruiter phone screen
2
Technical Interview — In-depth technical discussion focusing on systems reliability, cloud infrastructure, and problem-solving skills
3
Systems Design Challenge — Architectural design exercise demonstrating candidate's approach to reliability and scalability
4
On-Site/Virtual Interviews — Multiple interviews with engineering leadership and potential team members
5
Final Interview — Meeting with senior leadership to discuss role alignment and cultural fit

Apply for this position

You'll be redirected to the company's application page

More Jobs at Modal

10 other open positions

View all

Systems Engineering Manager

Stockholm

Manager

Infrastructure Security Engineer

New York

Senior

Forward Deployed Engineer - ML

Stockholm

Senior

Forward Deployed Engineer - ML

New York

Senior

Systems Engineering Manager

New York

Manager

Modal

View all jobs

Modal is a cloud computing platform that enables developers to build, deploy, and scale applications without managing underlying infrastructure. The company provides serverless computing solutions that allow developers to run code on demand, handle background jobs, and deploy machine learning models with minimal operational overhead. Modal serves software developers and organizations looking to streamline their application development and deployment processes in the cloud.

San Francisco, USAFounded 2021modal.com

Tech Stack

Languages

Python

Frameworks

Kubernetes

Databases

Distributed Databases

Tools

AWSMonitoring Tools

Other

Infrastructure as Code

Apply Now

Member of Technical Staff - Reliability Engineering

The role

Summary

What you'll do

What we look for

Technical

Education

Experience

Skills

Required skills

Nice to have

Compensation & benefits

Benefits

Interview process

More Jobs at Modal

Modal

Tech Stack

On this page