Cohere

Staff Software Engineer, GPU Infrastructure (HPC)

Cohere2 months ago
Location

Canada

Type

Full Time

Salary

CAD 180,000 – 250,000

Level

Staff

Role

Staff Software Engineer

Posted

Jan 15, 2026

Full TimeStaff

The role

Summary

Cohere is seeking a Staff Software Engineer for its GPU Infrastructure team, focusing on building and scaling high-performance computing (HPC) infrastructure to support cutting-edge AI model training. The role involves designing, optimizing, and managing Kubernetes-based GPU superclusters across multiple clouds, with a critical focus on performance, reliability, and enabling AI research workflows.

What you'll do

HPC Infrastructure Development: Design and deploy Kubernetes-based GPU/TPU superclusters across multiple cloud environments, ensuring high-throughput and low-latency performance for AI workloads.
Performance Optimization: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance using advanced networking technologies like RDMA and NCCL.
Infrastructure Troubleshooting: Proactively identify and resolve infrastructure bottlenecks, performance issues, and system failures to minimize disruption to AI/ML research workflows.
Research Enablement: Create self-service tools and intuitive interfaces that allow AI researchers to independently monitor, debug, and optimize their training jobs.
Innovation Leadership: Work closely with AI researchers to understand emerging technological needs and translate them into robust, scalable infrastructure solutions.
Best Practices Advocacy: Champion observability, automation, and infrastructure-as-code (IaC) practices across the organization to ensure system maintainability and resilience.

What we look for

Technical

ML/HPC InfrastructureExtensive experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing environments
Kubernetes ExpertiseProven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters at scale, specifically for AI workloads
Systems ProgrammingAdvanced proficiency in Python for ML tooling and Go for systems engineering, with a preference for open-source contributions

Education

Advanced DegreeBachelor's or Master's degree in Computer Science, Engineering, or related technical field preferred

Experience

Infrastructure ScaleDemonstrated experience managing large-scale, distributed computing environments in AI and machine learning contexts
Research CollaborationTrack record of working closely with AI researchers or ML engineers to solve complex infrastructure challenges

Skills

Required skills

KubernetesAdvanced configuration and management of Kubernetes clusters for AI workloads
PythonProficient programming for ML tooling and infrastructure development
GoSystems programming for infrastructure and performance-critical applications

Nice to have

RDMA NetworkingExperience with high-performance networking technologies
Linux InternalsDeep understanding of Linux system architecture and performance optimization

Compensation & benefits

Salary

CAD 180,000 – 250,000 (annual)

Stock options

Available

Benefits

Health Benefits

Comprehensive health and dental coverage with additional mental health budget

Parental Leave

100% salary top-up for up to 6 months of parental leave

Vacation

6 weeks (30 working days) of annual vacation

Work Flexibility

Remote-flexible work arrangement with co-working stipend

Personal Development

Enrichment benefits for arts, culture, fitness, and workspace improvement


Interview process

  1. 1
    Initial Screening HR review of application and initial qualifications match
  2. 2
    Technical Phone Screen Detailed discussion of technical background and infrastructure expertise
  3. 3
    Technical Interview In-depth technical assessment of Kubernetes, HPC, and systems engineering skills
  4. 4
    Research Collaboration Interview Evaluation of ability to work with AI researchers and solve complex infrastructure challenges
  5. 5
    Final Interview Meeting with team leadership to assess cultural fit and long-term potential

Apply for this position

You'll be redirected to the company's application page