Modal

Member of Technical Staff - Reliability Engineering

Modal2 months ago
Location

New York

Type

Full Time

Salary

USD 150,000 – 350,000

Level

Senior

Role

Site Reliability Engineer

Posted

Jan 19, 2026

Full TimeSenior

The role

Summary

Modal is seeking a systems-focused Reliability Engineer to be the first dedicated reliability hire, responsible for defining and implementing critical infrastructure reliability strategies for their AI cloud platform. The ideal candidate will be a deep systems thinker with extensive production experience, capable of improving system resilience, designing operational processes, and driving reliability culture across the engineering organization.

What you'll do

Reliability Architecture: Identify and implement architectural improvements to enhance system reliability, performance, and availability of Modal's cloud infrastructure.
Operational Process Design: Design and implement critical operational processes including deployments, upgrades, rollbacks, and comprehensive postmortem reviews.
Monitoring and Observability: Build robust monitoring systems to ensure high-quality service delivery and proactively identify potential system issues.
Production Incident Management: Participate in on-call rotations, respond to production incidents, and debug complex issues across all service levels and stack components.
Reliability Culture Development: Foster a strong culture of reliability and system resilience across Modal's engineering organization.

What we look for

Technical

Cloud InfrastructureDeep expertise in cloud technologies, with strong preference for AWS and hyperscaler cloud platforms
Kubernetes ManagementExperience scaling Kubernetes clusters to thousands of nodes and managing complex container orchestration environments
Systems DesignAdvanced understanding of systems safety research, control theory, and large-scale distributed system architectures

Education

Computer Science or EngineeringBachelor's or Master's degree in Computer Science, Software Engineering, or related technical field preferred

Experience

Production Engineering5+ years of experience writing high-quality production code in complex cloud environments
On-Call ExperienceMinimum 2 years of on-call experience managing critical production services

Skills

Required skills

Cloud InfrastructureExpertise in cloud platforms, particularly AWS, with strong understanding of infrastructure management
Systems ProgrammingAdvanced systems programming skills with ability to write high-performance, reliable production code
Incident ResponseProven ability to diagnose and resolve complex production issues across multiple system layers

Nice to have

Systems Safety ResearchBackground or experience with STAMP (Systems-Theoretic Accident Model and Processes) and control theory
Capacity PlanningExperience with auto-scaling, fleet management, and large-scale infrastructure capacity planning

Compensation & benefits

Salary

USD 150,000 – 350,000 (annual)

Stock options

Available

Benefits

Equity Compensation

Stock options in a high-growth AI infrastructure startup valued at $1.1B

Career Growth

Opportunity to be the first reliability-focused hire and shape the company's reliability practices

Innovative Work Environment

Join a team of open-source creators, academic researchers, and experienced engineering leaders


Interview process

  1. 1
    Initial Screening Technical resume review and initial recruiter phone screen
  2. 2
    Technical Interview In-depth technical discussion focusing on systems reliability, cloud infrastructure, and problem-solving skills
  3. 3
    Systems Design Challenge Architectural design exercise demonstrating candidate's approach to reliability and scalability
  4. 4
    On-Site/Virtual Interviews Multiple interviews with engineering leadership and potential team members
  5. 5
    Final Interview Meeting with senior leadership to discuss role alignment and cultural fit

Apply for this position

You'll be redirected to the company's application page