Staff Site Reliability Engineer, Release Engineering

Plaid1 weeks ago

Location

New York City Office

Type

Full Time

Salary

USD 207,600 – 273,600

Level

Staff

Role

Site Reliability Engineer

Posted

Jun 19, 2026

Full TimeStaff

The role

Summary

Staff Site Reliability Engineer at Plaid's Infrastructure team, focusing on Release Engineering. This hands-on technical leadership role involves architecting SLO and error-budget frameworks, driving progressive delivery adoption, and ensuring production readiness across product engineering. You'll design and scale reliability practices, build self-service platform tooling, and lead incident response while preparing systems for an AI-driven development landscape. Requires 8+ years in backend systems, SRE, or platform engineering with proven expertise in reliability program design and canary deployment systems.

What you'll do

Architect and Manage SLO and Error-Budget Framework: Design and implement comprehensive Service Level Objectives (SLO) and error-budget programs that empower engineering teams to leverage reliability data for informed product and release decisions. Establish metrics-driven governance structures that balance innovation velocity with production stability across all product teams.

Lead Reliability Standards Expansion: Define and scale Plaid's reliability practices across all product engineering organizations. Convert foundational infrastructure investments into lasting operational habits, documentation, and cultural shifts that embed production-readiness thinking into team workflows and decision-making processes.

Drive Progressive Delivery and Automated Safety Gates Adoption: Promote widespread adoption of progressive delivery patterns, including canary rollouts, metric-gated analysis, and automated rollback mechanisms. Design intuitive tooling and self-service platforms that enable teams to maintain high development velocity without compromising production safety or user experience.

Guide Product Teams Toward Production Readiness: Partner with emerging product teams to ensure production readiness through hands-on expertise in observability, incident response procedures, and scalable deployment health practices. Provide technical mentorship and establish maturity assessment frameworks that guide teams through reliability milestones.

Lead Critical Incident Response and Platform Improvements: Direct response efforts during critical production incidents, ensuring minimal customer impact and rapid resolution. Facilitate comprehensive post-mortem analysis processes and translate findings into permanent platform improvements, preventive tooling, and architectural enhancements.

Cross-Functional Platform Collaboration: Collaborate with SRE, Platform Engineering, and Infrastructure teams to translate complex production requirements into intuitive, user-friendly platform features and tooling. Serve as a bridge between operational needs and platform capabilities, ensuring solutions address real-world deployment and reliability challenges.

Prepare Systems for AI-Driven Development Velocity: Architect and scale safety nets, deployment systems, and reliability infrastructure to handle increased volume and frequency of code changes driven by AI-assisted development tools. Establish guardrails and automated checks that maintain production stability as development velocity accelerates.

What we look for

Technical

Canary Rollout and Progressive Delivery SystemsDirect hands-on experience building or operating canary deployment systems, metric-gated analysis pipelines, or automated rollback infrastructure in production environments at scale.

Service Level Objectives and Reliability FrameworksProven expertise designing and implementing reliability programs such as service maturity models, SLI frameworks, or error-budget systems that achieved measurable cross-team adoption and cultural impact.

Systems Programming and Backend DevelopmentStrong technical proficiency in backend systems development, with demonstrated expertise in Go or similar systems languages. Ability to author and review production-grade infrastructure code.

Kubernetes and Container OrchestrationSolid experience with Kubernetes architecture, container deployment patterns, and orchestration best practices in cloud-native environments.

Observability and Monitoring StackDeep expertise with Prometheus, time-series databases, and observability platforms. Ability to design comprehensive monitoring, alerting, and tracing strategies for complex distributed systems.

Service Mesh and Advanced NetworkingPrior exposure to service mesh technologies and their role in enabling progressive delivery, traffic management, and reliability patterns in microservice architectures.

Infrastructure as Code and GitOpsHands-on experience with ArgoCD, Terraform, or similar infrastructure-as-code and GitOps tools for managing declarative, version-controlled infrastructure and deployments.

Education

Bachelor's Degree in Computer Science or Related FieldFormal education in Computer Science, Software Engineering, Mathematics, or related technical discipline, or equivalent professional experience demonstrating advanced systems thinking.

Experience

Senior Backend or Platform EngineeringMinimum 8 years of professional experience in backend systems development, Site Reliability Engineering (SRE), or platform engineering roles, with progressive responsibility and impact on infrastructure scale.

Production Reliability and Incident ManagementSubstantial experience designing and operating production systems with proven expertise in incident response, root cause analysis, postmortem processes, and translating incidents into platform improvements.

Organizational Influence and Change LeadershipDemonstrated ability to drive organizational change and influence engineering culture without formal authority. Track record of getting buy-in from multiple engineering teams for reliability initiatives and practices.

Skills

Required skills

Go ProgrammingAdvanced proficiency in Go for systems programming, with ability to design and implement high-performance, concurrent infrastructure components.

Kubernetes ArchitectureDeep understanding of Kubernetes deployment models, networking, storage, and operational patterns at enterprise scale.

Prometheus and Time-Series MonitoringExpert-level knowledge of Prometheus architecture, metric design, and time-series analysis for building effective observability solutions.

Canary Deployment SystemsHands-on experience with canary rollout patterns, progressive delivery orchestration, and metric-gated deployment safety mechanisms.

SLO and SLI DesignExpertise in defining Service Level Objectives, Service Level Indicators, and error budgets that align with business objectives and technical capabilities.

Incident Command and Post-Mortem FacilitationMastery of incident management processes, blameless postmortem facilitation, and translating incident learnings into actionable improvements.

Technical Leadership and InfluenceAbility to influence engineering culture, drive adoption of reliability practices, and lead technical initiatives across multiple teams without formal authority.

Nice to have

Service Mesh Technologies (Istio, Envoy)Experience with service mesh platforms for traffic management, observability integration, and enabling progressive delivery patterns in microservice environments.

ArgoCD and GitOps WorkflowsHands-on experience implementing GitOps deployment patterns and continuous delivery pipelines using declarative infrastructure approaches.

Distributed Systems DesignUnderstanding of distributed systems principles, consensus algorithms, and patterns relevant to building resilient infrastructure platforms.

Policy as Code and OPA/ConftestExperience with policy-as-code frameworks for automating compliance checks, deployment guardrails, and infrastructure validation across organizations.

Terraform and Infrastructure as CodeProficiency with Terraform for managing complex multi-cloud or multi-region infrastructure deployments in a reproducible, version-controlled manner.

Cost Optimization and Resource ManagementTrack record of designing systems that optimize cloud infrastructure costs while maintaining performance and reliability standards.

FinTech or Payment SystemsPrior experience working in financial technology, payment systems, or regulated industries where reliability and audit requirements are critical.

Compensation & benefits

Salary

USD 207,600 – 273,600 (annual)

Stock options

Available

Benefits

Comprehensive Health Coverage

Medical, dental, and vision insurance plans covering employee and family members with competitive deductibles and out-of-pocket maximums.

401(k) Retirement Plan

Company-sponsored 401(k) retirement savings plan with employer matching contributions to support long-term financial planning.

Equity and Stock Options

Participation in company equity programs and stock options aligned with company performance, providing wealth-building opportunities for eligible employees.

Flexible Work Arrangements

Flexible remote work options and location flexibility for engineering teams, enabling work-life balance and access to distributed talent.

Professional Development

Learning budgets, conference attendance support, and professional development opportunities to expand technical skills and industry knowledge.

Paid Time Off

Generous paid time off policies including vacation days, sick leave, and company holidays to support employee wellbeing and work-life balance.

Diversity and Inclusion Programs

Commitment to building a diverse workforce with employee resource groups, mentorship programs, and inclusive hiring practices.

Financial Wellness Programs

Employee assistance programs and financial planning resources, including expertise in fintech benefits given Plaid's industry focus.

Interview process

1
Initial Screening and Experience Review — Recruiter conducts initial phone screening to assess background in SRE, platform engineering, or backend systems. Focus on verifying professional experience level, familiarity with release engineering practices, and alignment with Staff-level expectations.
2
Technical Architecture Discussion — First-round conversation with current SRE or Infrastructure team members exploring specific projects, reliability program design decisions, and technical approach to solving deployment challenges. Discussion centers on canary systems, SLO design, and production incident examples.
3
Systems Design Interview — Detailed technical interview focusing on designing a production-grade progressive delivery system or SLO framework. Candidates work through trade-offs in monitoring, rollback strategies, and handling high-velocity deployments while maintaining safety.
4
Behavioral and Leadership Interview — Conversation with Engineering Manager or Infrastructure leadership exploring organizational influence, change management approach, incident response philosophy, and mentorship style. Emphasis on driving adoption across skeptical teams and handling high-pressure incidents.
5
Fintech Domain and Culture Fit Discussion — Optional conversation exploring fintech experience, understanding of Plaid's mission around financial inclusion, and cultural alignment with Plaid's principles of inventing tomorrow and embracing openness.
6
Offer and Compensation Discussion — HR and hiring manager finalize offer including base salary, equity package, and benefits. Discussion covers relocation assistance if needed, start date, and any accommodations required for the onboarding process.