AI System Research and Development Engineer - Optimization

SnowflakeYesterday

Location

US-WA-Bellevue

Type

Full Time

Salary

USD 200,000 – 287,500

Level

Senior

Role

ML Engineer

Posted

Apr 30, 2026

Full TimeSenior

The role

Summary

Join Snowflake's AI Research team as an AI System Research and Development Engineer focused on optimization. This role involves developing and optimizing GPU kernels, deep learning systems, and LLM inference/training infrastructure while collaborating with world-class researchers including founding members of DeepSpeed and vLLM. You'll contribute to cutting-edge AI system optimizations like SwiftKV and Arctic LLM, requiring strong expertise in CUDA, PyTorch/TensorFlow, and GPU architecture.

What you'll do

GPU Kernel Analysis and Optimization: Analyze and optimize GPU kernels for both training and inference workloads of large language models, focusing on memory access patterns, compute efficiency, and latency reduction. Profile kernel performance using advanced tools and implement targeted optimizations to achieve measurable improvements.

Deep Learning System Efficiency: Develop and implement comprehensive strategies to enhance the efficiency and scalability of deep learning systems, including techniques for better resource utilization, improved throughput, and reduced training/inference costs at scale.

Performance Profiling and Benchmarking: Profile and benchmark deep learning systems using industry-standard profiling tools and rigorous performance analysis methodologies. Identify bottlenecks in GPU utilization, memory bandwidth, and computational patterns to inform optimization strategies.

Latency and Resource Optimization: Design and implement targeted optimizations to reduce inference latency, improve training convergence speed, and maximize GPU resource utilization. Collaborate with cross-functional teams to deploy optimizations in production systems.

Research and Development Leadership: Stay current with the latest advancements in GPU kernel optimization techniques, deep learning system architectures, and LLM system development. Evaluate emerging technologies and methodologies for potential integration into Snowflake's AI infrastructure.

Agentic Framework Development: Contribute to the design and development of agentic frameworks and applications for LLM-driven workflows, focusing on enhancing automation capabilities, reasoning performance, and decision-making accuracy across enterprise use cases.

Open-Source Publication and Thought Leadership: Document innovations, optimizations, and engineering practices for public release. Publish research findings and technical insights in open-source repositories, industry blogs, and top-tier conferences and journals to advance the broader AI systems community.

Cross-Functional Collaboration: Work effectively with founding members of leading AI infrastructure projects like DeepSpeed and vLLM. Collaborate with research scientists, infrastructure engineers, and product teams to translate optimizations into customer-facing value.

What we look for

Technical

GPU Kernel OptimizationDemonstrated expertise in analyzing and optimizing GPU kernels for training and inference, with measurable improvements in throughput or latency.

CUDA Framework MasteryAdvanced proficiency with CUDA programming, including memory optimization, synchronization patterns, and performance tuning techniques.

Deep Learning Framework ExpertiseProduction-grade experience with PyTorch, TensorFlow, or JAX, including implementation of custom operations and performance optimization.

Performance ProfilingHands-on ability to use nvprof, NVIDIA Nsight, and other profiling tools to identify performance bottlenecks and validate optimizations.

GPU Architecture KnowledgeDeep understanding of modern GPU architectures including memory models, compute capabilities, and hardware-software co-design principles.

HPC System OptimizationExperience optimizing systems for high-performance computing, including multi-GPU scaling and distributed computing patterns.

Kernel Optimization LibrariesPractical experience with CUTLASS, Triton, cuDNN, or similar libraries for implementing and optimizing deep learning operations.

Education

Bachelor's DegreeBachelor's degree in Computer Science, Electrical Engineering, Computer Engineering, or a closely related field required.

Advanced Degree PreferredMaster's degree or PhD in Computer Science, Computer Engineering, Electrical Engineering, or related disciplines strongly preferred for advanced research roles.

Experience

5+ Years GPU Optimization ExperienceMinimum 5 years of professional experience in GPU kernel optimization, deep learning system optimization, or high-performance computing.

Deep Learning System OptimizationProven track record optimizing deep learning systems for production workloads, with demonstrated impact on performance metrics.

LLM or Agentic Systems DevelopmentExperience developing, optimizing, or contributing to large language model systems, inference engines, or agentic AI frameworks.

Open-Source ContributionsActive contributions to open-source deep learning or AI infrastructure projects demonstrating systems thinking and collaboration skills.

Skills

Required skills

GPU Kernel OptimizationExpert-level proficiency in analyzing and optimizing GPU kernels for training and inference workloads, with deep understanding of memory hierarchy, occupancy, and instruction-level parallelism.

CUDA ProgrammingAdvanced CUDA expertise for GPU computing, including kernel development, memory optimization, and performance tuning for large-scale deep learning systems.

Deep Learning FrameworksProduction-level proficiency with PyTorch, TensorFlow, or JAX for implementing and optimizing deep learning models at scale.

LLM System OptimizationExperience optimizing large language model inference and training systems for efficiency, latency reduction, and resource utilization.

GPU Architecture UnderstandingDeep knowledge of modern GPU architectures (NVIDIA H100, A100, etc.), memory models, compute capabilities, and hardware-software co-design principles.

Performance Profiling and BenchmarkingHands-on experience with profiling tools including nvprof, NVIDIA Nsight, and methodologies for identifying and analyzing performance bottlenecks in complex systems.

High-Performance Computing (HPC)Solid background in HPC principles, distributed computing, and optimization techniques for scaling workloads across multiple GPUs and nodes.

Problem-Solving and DebuggingStrong analytical capabilities with ability to systematically debug complex performance issues and implement effective solutions in production systems.

Nice to have

CUTLASS and TritonHands-on experience with CUTLASS or Triton libraries for implementing high-performance GPU kernels with ease and portability.

cuDNN OptimizationExperience optimizing deep learning operations using cuDNN and understanding of best practices for leveraging optimized neural network primitives.

Open-Source ContributionsTrack record of contributing to or maintaining open-source deep learning frameworks and optimization libraries like vLLM, DeepSpeed, or TensorRT.

Agentic Systems DevelopmentExperience building or optimizing agentic frameworks and applications for LLM-driven workflows with focus on reasoning, automation, and decision-making.

Distributed Training SystemsFamiliarity with distributed training frameworks, collective communications libraries (NCCL), and techniques for scaling training across multiple GPUs.

Inference Optimization TechnologiesKnowledge of inference optimization techniques including quantization, pruning, knowledge distillation, and dynamic batching strategies.

Technical Publication ExperienceExperience publishing research papers or technical blogs in top-tier conferences and journals, demonstrating ability to communicate complex optimization techniques.

C++ Systems ProgrammingStrong C++ skills for systems-level optimization work, especially in performance-critical components of deep learning infrastructure.

Compensation & benefits

Salary

USD 200,000 – 287,500 (annual)

Stock options

Available

Benefits

Equity Compensation

Competitive stock options as part of total compensation package, allowing you to participate in Snowflake's growth as a public company.

Health and Wellness Coverage

Comprehensive health insurance including medical, dental, and vision coverage for employees and dependents.

Retirement Planning

401(k) retirement savings plan with company matching to support long-term financial security.

Professional Development

Access to learning resources, conference attendance budgets, and opportunities to publish research at top-tier venues.

Collaborative Research Environment

Work alongside world-class AI researchers and engineers, including founding members of DeepSpeed, vLLM, and TensorFlow projects.

Flexible Work Arrangement

Flexible work environment supporting remote and hybrid arrangements for qualified candidates.

Innovation Culture

Opportunity to work on cutting-edge AI systems and contribute to open-source projects with direct impact on the AI community.

Interview process

1
Initial Screening — Recruiter phone screen to discuss your background, experience with GPU optimization, and alignment with the role's technical requirements and research-focused mission.
2
Technical Deep Dive — Technical interview with AI Research team members covering GPU architecture knowledge, CUDA programming experience, and performance optimization strategies. Expect detailed discussions about past optimization projects and approaches.
3
System Design and Problem Solving — Interview focusing on system-level optimization challenges, including designing solutions for improving LLM inference efficiency, analyzing performance trade-offs, and proposing optimization strategies for complex deep learning systems.
4
Research and Innovation Discussion — Conversation with senior researchers about your research interests, published work, open-source contributions, and vision for advancing AI systems optimization. Discussion of potential research directions within Snowflake's AI infrastructure.
5
Cross-Functional Collaboration — Meeting with engineering leadership and cross-functional stakeholders to assess collaboration style, communication skills, and ability to drive optimization efforts across teams.
6
Leadership Discussion — Final conversation with hiring manager or director-level stakeholder covering career goals, research vision, and how this role aligns with your long-term objectives in AI systems optimization.