System Software Engineer, Robot Platform — GPU & Accelerated Compute

Sunday5 days ago

Location

Redwood City, CA

Type

Full Time

Salary

USD 160,000 – 220,000

Level

Mid

Role

Backend Engineer

Posted

May 11, 2026

Full TimeMid

The role

Summary

This System Software Engineer role on Robot Platform focuses on GPU and accelerated compute systems for home robotics. You'll architect efficient GPU scheduling, model execution, and data transfer pipelines for real-time robotic inference and perception workloads. The position requires deep expertise in CUDA systems programming, GPU architecture optimization, and Linux kernel fundamentals to ensure the GPU operates as a first-class resource meeting latency and throughput requirements across concurrent robotics applications.

What you'll do

GPU Scheduling and Resource Arbitration: Design and implement GPU scheduling mechanisms including time-slicing and multi-process service (MPS) configurations to arbitrate GPU access across concurrent users such as model inference, SLAM, and robotics applications while maintaining predictable latency requirements for real-time systems.

Efficient Model Execution Framework: Reduce GPU kernel launch overheads and engineer fast, predictable model switching capabilities on the same device through kernel optimization and runtime abstraction layers to support dynamic workload switching in robotics applications.

GPU Memory and Data Transfer Optimization: Build efficient CPU-to-GPU data movement paths including pinned memory management, zero-copy transfer mechanisms, and asynchronous patterns. Optimize camera frame ingestion into GPU memory with hardware-accelerated encode/decode integration (NVDEC/NVENC).

CPU-GPU Synchronization Architecture: Design and implement synchronization primitives and communication patterns that minimize CPU-GPU stalls and keep inference pipelines operating at full utilization, ensuring seamless data flow between host and device.

Cross-Functional Platform Collaboration: Partner with ML, SLAM/Perception, Controls, and Hardware teams to validate GPU platform requirements, establish performance benchmarks, and ensure GPU is utilized as a first-class resource meeting end-to-end system latency and throughput constraints.

Performance Profiling and Optimization: Leverage GPU profiling tools including NVIDIA Nsight Systems and Nsight Compute to identify bottlenecks, measure kernel efficiency, and iteratively optimize accelerated compute workloads for real-time robotic performance.

Developer Infrastructure and Platform Tools: Contribute to build and delivery infrastructure that enables teams to rapidly develop, test, ship, and update robot software safely across the fleet while maintaining GPU resource stability and predictability.

What we look for

Technical

CUDA Systems Programming2+ years of professional experience developing GPU systems software with deep proficiency in CUDA programming model, runtime APIs, CUDA Graphs, and CUDA IPC for inter-process communication patterns.

Systems Language ExpertiseAdvanced proficiency in at least one systems programming language such as C++, C, or Rust for implementing low-latency, high-performance GPU scheduling and data movement systems.

GPU Architecture KnowledgeSolid understanding of modern GPU architectures including compute capabilities, memory hierarchies, occupancy constraints, and tradeoffs between different GPU utilization strategies like MPS (Multi-Process Service) and MIG (Multi-Instance GPU).

Linux Systems FundamentalsDeep knowledge of Linux kernel concepts including process scheduling, inter-process communication (IPC), virtual and physical memory management, and performance tuning for real-time systems.

GPU Profiling and Performance AnalysisHands-on experience with NVIDIA profiling toolchain including Nsight Systems for timeline analysis and Nsight Compute for kernel-level performance debugging and optimization.

Real-Time Systems ConstraintsExperience architecting systems that meet strict latency and throughput requirements, understanding predictability trade-offs, and implementing deterministic scheduling patterns for robotics applications.

Education

Bachelor's Degree in Computer Science or Related FieldBS/BA in Computer Science, Computer Engineering, Electrical Engineering, or equivalent practical experience demonstrating strong fundamentals in systems and distributed computing.

Experience

GPU Systems DevelopmentMinimum 2+ years of production experience developing GPU system software, memory management layers, or scheduling systems that optimize for both performance and resource contention.

Robotics or Embedded SystemsBackground developing systems-level software for robotics, autonomous vehicles, or embedded AI platforms where GPU acceleration was critical to real-time performance.

Performance-Critical SystemsTrack record of shipping performance-optimized infrastructure for latency-sensitive applications, demonstrated through concrete examples of system optimization and measurement results.

Skills

Required skills

CUDA ProgrammingExpert-level CUDA development including memory management, kernel optimization, stream management, and understanding of warp-level operations and occupancy calculations.

C++ or C ProgrammingAdvanced proficiency in modern C++ (C++17+) or C for systems programming, including template metaprogramming, memory management, and performance optimization techniques.

GPU ArchitectureDeep understanding of GPU compute capabilities, SM/warp execution models, memory coalescing patterns, and latency vs. throughput optimization tradeoffs.

Linux Kernel ConceptsStrong fundamentals in Linux process scheduling, memory management (virtual/physical addressing, paging), IPC mechanisms, and system call interfaces.

Profiling and DebuggingProficiency with Nsight Systems, Nsight Compute, and traditional Linux profiling tools (perf, strace) for diagnosing performance bottlenecks in GPU workloads.

Real-Time Systems DesignAbility to design systems with predictable latency bounds, understanding scheduling algorithms, priority inversion problems, and synchronization primitives for embedded real-time environments.

Nice to have

CUDA Library DevelopmentContributions to CUDA libraries (cuBLAS, cuDNN, TensorRT) or open-source GPU programming frameworks demonstrating expertise in high-performance GPU abstraction layers.

Camera Pipeline IntegrationExperience with camera sensor integration, image processing pipelines, and NVIDIA hardware accelerators like NVDEC (video decode) and NVENC (video encode).

Embedded GPU OptimizationExperience optimizing model inference and GPU workloads on embedded platforms such as NVIDIA Jetson boards, where resource constraints require careful power and thermal management.

Observability and TracingExperience implementing observability solutions and distributed tracing for GPU-accelerated workloads, understanding end-to-end latency measurement and critical path analysis.

ML Model ServingExperience with ML inference frameworks (TensorRT, ONNX Runtime, TVM) and optimization techniques for deploying models on edge GPU platforms.

Rust Systems ProgrammingProficiency in Rust for systems-level programming, particularly for memory safety-critical infrastructure where Rust's guarantees add engineering value.

Compensation & benefits

Salary

USD 160,000 – 220,000 (annual)

Stock options

Available

Benefits

Equity Stake

Stock options providing long-term upside alignment with Sunday's mission to make home robotics accessible to all households.

Health Insurance Coverage

Comprehensive medical, dental, and vision insurance for you and your family with competitive plan options.

Professional Development

Learning budget for courses, conferences, and technical training to stay current with GPU computing advancements and robotics technology.

Flexible Work Environment

Remote-friendly or flexible arrangements enabling work-life balance for senior technical contributors.

Cutting-Edge Technology

Access to state-of-the-art GPU hardware and robotics platforms for development, testing, and optimization work.

Collaborative Team Culture

Work alongside world-class ML engineers, roboticists, and systems architects in a startup environment focused on meaningful innovation.

Interview process

1
Initial Screening Call — 30-minute technical screening with a member of the Robot Platform team to discuss your GPU systems experience, CUDA background, and understanding of the role's core challenges around GPU scheduling and model inference optimization.
2
Technical Deep Dive Interview — 60-90 minute interview with senior systems engineers covering GPU architecture fundamentals, CUDA programming patterns, memory management strategies, and your approach to designing efficient data transfer mechanisms in latency-critical systems.
3
Robotics Systems Context Discussion — 45-minute conversation with cross-functional stakeholders (ML engineers, SLAM/Perception team leads) to understand robotics-specific latency requirements, concurrent workload patterns, and how GPU resource scheduling impacts downstream system performance.
4
Scheduling and Architecture Design Problem — Technical take-home or live design exercise where you architect a GPU scheduling solution for competing robotics workloads with different latency and throughput requirements, demonstrating systems thinking and tradeoff analysis.
5
Leadership and Culture Fit Interview — 30-minute conversation with team leadership assessing collaboration style, communication of complex technical concepts, and alignment with Sunday's mission to build accessible home robotics technology.
6
Offer and Compensation Discussion — Final discussion covering equity grants, benefits package, and long-term growth opportunities within the Robot Platform and broader Sunday engineering organization.