Senior Platform Engineer

Apollo GraphQL3 days ago

Location

United States or Canada (remote)

Workplace

Remote

Type

Full Time

Salary

USD 165,000 – 195,000

Level

Senior

Role

Platform Engineer

Posted

Jun 30, 2026

Full TimeRemoteSenior

The role

Summary

Senior Platform Engineer at Apollo GraphQL, a leader in GraphQL technology, seeking a distributed systems expert to drive platform engineering excellence. This role focuses on infrastructure automation, Kubernetes service delivery, SLO-driven reliability, and cross-organizational leadership within a high-performing team that values elegant solutions and operational excellence.

What you'll do

Lead Cross-Organizational Platform Initiatives: Take ownership of medium to large-impact subsystems and projects, driving infrastructure modernization efforts across Apollo GraphQL. Execute on strategic platform engineering initiatives that span multiple teams, ensuring successful delivery of scalable distributed systems solutions.

Design Infrastructure and Service Delivery Architecture: Create forward-thinking technical designs for Kubernetes-based service delivery, Terraform infrastructure-as-code management, and CI/CD pipelines. Proactively address cost efficiency, security posture, and observability in all architectural decisions using data-driven approaches.

Automate Infrastructure Operations and DevOps Workflows: Build and enhance automation frameworks across container orchestration, infrastructure provisioning, and deployment pipelines. Leverage tools like ArgoCD, Atlantis, and CircleCI to eliminate manual processes and reduce operational overhead.

Drive Developer Velocity and Operational Excellence: Implement platform capabilities that accelerate developer productivity while maintaining high reliability standards. Utilize DORA metrics and observability frameworks to measure and continuously improve platform performance, reducing deployment friction and incident response times.

Deliver Technical Documentation and Design Reviews: Produce comprehensive technical artifacts including design documents, one-pagers, decision records (DRs), and operational runbooks. Participate in design review processes to ensure architectural decisions align with company standards and future scalability requirements.

Lead On-Call Operations and Production Support: Participate in on-call rotations and take full ownership of incident resolution, root cause analysis, and prevention. Identify and resolve systemic reliability issues, eliminating noisy monitoring alerts through proactive remediation and infrastructure hardening.

Conduct Technical Interviews and Mentor Engineers: Participate in recruiting efforts by conducting technical interviews for platform engineering candidates. Mentor and guide team members on distributed systems concepts, infrastructure best practices, and operational reliability patterns.

Collaborate Across Teams on Platform Capabilities: Work with internal stakeholders to understand requirements for logging, monitoring, deployment, and infrastructure services. Build consensus on platform direction and foster a collaborative culture where platform improvements benefit the entire organization.

What we look for

Technical

Distributed Systems ArchitectureDeep expertise designing and operating stateless, fault-tolerant systems with understanding of eventual consistency models, event-driven architectures, and asynchronous patterns. Proven ability to reason about system behavior under failure conditions and design for high availability.

Kubernetes Container OrchestrationProduction-grade experience operating and optimizing Kubernetes clusters, including workload management, resource allocation, networking, and troubleshooting. Understanding of declarative infrastructure patterns and cluster operations at scale.

Infrastructure-as-Code and TerraformProficiency with Terraform for managing cloud infrastructure across multiple environments. Experience writing modular, maintainable IaC that follows best practices for state management, modularity, and reproducibility.

CI/CD Pipeline Design and AutomationExperience designing and implementing continuous integration and deployment pipelines. Familiarity with GitOps principles, deployment automation tools, and strategies for managing infrastructure and application deployments safely at scale.

Cloud Platform OperationsStrong working knowledge of Google Cloud Platform (GCP), with transferable expertise applicable to AWS or Azure. Understanding of compute, networking, storage services, and cloud-native operational patterns.

Observability and Monitoring ArchitectureExperience implementing comprehensive observability solutions including logging, metrics collection, distributed tracing, and alerting. Proficiency with monitoring tools like DataDog and understanding of SLO-driven reliability practices.

AI-Assisted Development and Production IntelligenceDemonstrated ability leveraging agentic tooling and AI systems to enhance daily engineering workflows and detect, diagnose, and mitigate production issues. Experience using automation to improve operational efficiency and system reliability.

Education

Bachelor's Degree in Computer Science, Engineering, or Related FieldStrong foundation in computer science fundamentals, distributed systems theory, and software engineering principles. Equivalent professional experience demonstrating mastery of these core concepts may substitute for formal degree.

Experience

Senior-Level Platform or Infrastructure Engineering5+ years of progressive experience in platform engineering, site reliability engineering (SRE), or infrastructure operations roles. Demonstrated track record of owning critical systems, leading architectural decisions, and driving organizational platform improvements.

Cross-Team Leadership and CollaborationProven ability to work effectively across organizational boundaries, building consensus on platform direction and aligning diverse stakeholder interests. Experience mentoring junior engineers and contributing to technical hiring decisions.

Operational Incident ManagementSubstantial on-call experience responding to and resolving production incidents. Track record of performing effective root cause analysis, implementing permanent fixes, and eliminating systemic reliability issues through infrastructure improvements.

Data-Driven Decision MakingStrong track record of using metrics, observability data, and analytics to drive technical and business decisions. Experience applying frameworks like DORA to measure and improve engineering effectiveness and operational performance.

Technical Design and DocumentationExperience writing technical designs, architectural decision records, and operational documentation. Ability to communicate complex infrastructure concepts to both technical and non-technical audiences through clear written and verbal communication.

Skills

Required skills

KubernetesProduction-grade expertise with container orchestration, cluster operations, workload management, and troubleshooting.

TerraformProficiency with infrastructure-as-code for provisioning and managing cloud resources across multiple environments.

Google Cloud Platform (GCP)Strong working knowledge of GCP services, cloud architecture patterns, and cloud-native operations.

Distributed Systems DesignDeep understanding of distributed computing patterns, eventual consistency, fault tolerance, and event-driven architectures.

CI/CD PipelinesExperience designing and implementing continuous integration and continuous deployment workflows and automation.

System ObservabilityExpertise in implementing monitoring, logging, metrics collection, and alerting for complex distributed systems.

On-Call OperationsProduction incident response, root cause analysis, and implementation of permanent reliability fixes.

Technical Architecture and DesignAbility to design scalable, maintainable systems and communicate architectural decisions through technical documentation.

Nice to have

HelmPackage management and templating for Kubernetes applications, enabling standardized deployments across environments.

ArgoCDGitOps-based continuous delivery tool for Kubernetes, enabling declarative application deployment workflows.

AtlantisInfrastructure-as-code workflow tool that enables collaborative infrastructure management and GitOps practices.

CircleCIContinuous integration platform for automated testing and deployment pipeline orchestration.

DataDogEnterprise observability platform for monitoring, logging, and analytics across distributed systems.

DockerContainer runtime and containerization best practices for packaging applications and services.

GraphQLExperience with GraphQL APIs and understanding of Apollo GraphQL's role in the GraphQL ecosystem.

AI/ML-Powered ToolingExperience with agentic systems and AI tools for automating operations, diagnostics, and incident response workflows.

Compensation & benefits

Salary

USD 165,000 – 195,000 (annual)

Stock options

Available

Benefits

Equity and Stock Options

Competitive equity package providing ownership stake in Apollo GraphQL's future success.

Health Insurance

Comprehensive medical, dental, and vision coverage for employee and family.

Retirement Planning

401(k) retirement plan with employer contribution matching.

Professional Development

Education budget for conferences, courses, and professional growth in platform engineering and distributed systems.

Flexible Work Arrangement

Remote-first work environment enabling geographic flexibility and work-life balance.

Unlimited Paid Time Off

Flexible PTO policy enabling engineers to manage personal time and wellness needs.

Parental Leave

Paid leave for new parents supporting family growth and work-life integration.

Interview process

1
Initial Screening Call — 30-minute conversation with recruiting team to assess background, career goals, and alignment with platform engineering at Apollo GraphQL.
2
Technical Deep Dive Interview — 60-90 minute session with platform engineering team members covering distributed systems concepts, Kubernetes architecture, infrastructure design patterns, and real-world problem-solving scenarios.
3
System Design Discussion — Technical interview focusing on designing scalable infrastructure solutions, addressing trade-offs between cost, reliability, and observability. May include infrastructure-as-code design or Kubernetes architecture scenarios.
4
Cross-Functional Collaboration Interview — Conversation with team members outside platform engineering to assess collaboration style, communication ability, and cross-team impact potential.
5
Leadership and Culture Fit Interview — Discussion with senior engineering leadership to explore mentorship philosophy, decision-making approach, and alignment with Apollo's engineering culture emphasizing humility and mindfulness.
6
Final Offer Discussion — Detailed conversation with hiring manager covering role expectations, growth opportunities, and compensation discussion.

Apply for this position

You'll be redirected to the company's application page

More Jobs at Apollo GraphQL

3 other open positions

View all

Senior Software Engineer - Trust and Telemetry

United States or Canada (remote)Remote

Senior

Staff Security Operations Engineer

US time zones (remote)

Staff

Senior Software Engineer, Rust

United States or Canada (remote)Remote

Senior

Apollo GraphQL

View all jobs

Apollo GraphQL develops open-source GraphQL tools and a leading GraphQL implementation platform, empowering teams to build, query, and manage APIs efficiently.

Remote, USAFounded 2015apollographql.com

Tech Stack

Languages

GoPythonBash/ShellYAML

Frameworks

KubernetesHelmArgoCD

Databases

PostgreSQLCloud Datastore/Firestore

Tools

TerraformDockerArgoCDAtlantisCircleCIDataDogGit/GitHub

Other

Google Cloud Platform (GCP)SLO/SLI/SLA FrameworkDORA MetricsGitOpsObservability and Monitoring

Apply Now

Senior Platform Engineer

The role

Summary

What you'll do

What we look for

Technical

Education

Experience

Skills

Required skills

Nice to have

Compensation & benefits

Benefits

Interview process

More Jobs at Apollo GraphQL

Apollo GraphQL

Tech Stack

On this page