AI System Research and Development Engineer - Optimization

Snowflake2 weeks ago

Location

US-WA-Bellevue

Type

Full Time

Salary

USD 200,000 – 265,000

Level

Senior

Role

ML Engineer

Posted

Jun 16, 2026

Full TimeSenior

The role

Summary

Join Snowflake's AI Research team as an AI System Research and Development Engineer focused on optimization, where you'll develop and optimize GPU kernel performance for LLM training and inference systems. This role involves designing high-performance computing solutions, implementing system optimizations to reduce latency and improve resource utilization, and contributing to cutting-edge agentic frameworks. You'll collaborate with world-class engineers including founding members from DeepSpeed, vLLM, and TensorFlow to build the most efficient and scalable generative AI systems.

What you'll do

GPU Kernel Performance Optimization: Analyze, profile, and optimize GPU kernel performance for both training and inference of large language models using advanced profiling tools such as nvprof and NVIDIA Nsight. Identify performance bottlenecks and implement targeted optimizations to maximize throughput and minimize computational latency.

Deep Learning System Architecture: Design and implement comprehensive strategies to enhance the efficiency and scalability of deep learning systems across distributed computing environments. Develop solutions for model parallelism, data parallelism, and pipeline parallelism to support large-scale training.

Performance Benchmarking and Analysis: Conduct detailed profiling and benchmarking of deep learning systems using industry-standard tools and methodologies. Create performance analysis reports that quantify improvements and establish baseline metrics for ongoing optimization efforts.

Latency and Resource Optimization: Design and implement targeted optimizations to reduce inference latency and training time while improving resource utilization across GPU clusters. Focus on memory efficiency, bandwidth optimization, and compute density improvements.

Agentic Framework Development: Contribute to the development of agentic frameworks and applications for LLM-driven workflows, building systems that enhance automation, reasoning, and decision-making capabilities. Implement autonomous agent architectures that leverage advanced LLM capabilities.

Research and Innovation Publication: Stay current with the latest advancements in GPU kernel optimization, deep learning systems, and LLM development. Open-source innovations and publish research findings in technical blogs, top-tier conferences, and peer-reviewed journals to establish thought leadership.

Cross-Functional Collaboration: Work effectively with research scientists, infrastructure engineers, and product teams in a fast-paced environment. Provide technical guidance and contribute to architectural decisions for next-generation AI systems.

What we look for

Technical

CUDA and GPU Architecture ExpertiseExpert-level understanding of CUDA programming, GPU memory hierarchies, and modern GPU architectures (NVIDIA architecture families). Experience with custom kernel development and optimization.

Deep Learning FrameworksProduction proficiency with PyTorch, TensorFlow, JAX, or similar frameworks. Deep understanding of computational graph optimization, automatic differentiation, and framework internals.

Specialized Optimization LibrariesHands-on experience with CUTLASS, Triton, cuDNN, and similar specialized libraries for accelerating deep learning operations. Understanding of kernel fusion, operator optimization, and mixed-precision training.

Performance Analysis and ProfilingExpert proficiency with nvprof, NVIDIA Nsight, PyTorch Profiler, and other performance analysis tools. Ability to interpret performance metrics, identify bottlenecks, and correlate profiling data to system behavior.

Problem-Solving and DebuggingAdvanced ability to debug complex performance issues, analyze system behavior under load, and design targeted solutions. Strong understanding of numerical stability, precision trade-offs, and optimization constraints.

Education

Bachelor's Degree in Computer Science or Electrical EngineeringRequired foundation in computer science, electrical engineering, or closely related field with strong fundamentals in algorithms, data structures, and systems design.

Master's or PhD Degree (Preferred)Advanced degree in machine learning, computer architecture, parallel computing, or related field is highly preferred and demonstrates deep expertise in system optimization.

Experience

5+ Years GPU Kernel Optimization ExperienceSignificant hands-on experience in GPU kernel optimization, deep learning system optimization, or high-performance computing (HPC). Track record of shipping optimizations that delivered measurable performance improvements.

Deep Learning Framework Production ExperienceDemonstrated experience developing and optimizing deep learning workflows using production-grade frameworks in real-world applications at scale.

High-Performance Computing SystemsSubstantial experience with distributed computing, multi-GPU training, and large-scale deep learning system deployment in production environments.

Skills

Required skills

CUDA ProgrammingExpert-level CUDA development with proven ability to write, optimize, and debug GPU kernels. Understanding of memory coalescing, occupancy optimization, and synchronization primitives.

GPU Architecture KnowledgeDeep understanding of modern GPU architectures including memory hierarchies, warp execution, shared memory optimization, and tensor core utilization.

PyTorch or TensorFlowProduction-grade proficiency with PyTorch, TensorFlow, or JAX. Ability to extend frameworks with custom operators and optimize computational graphs for specific hardware targets.

Performance ProfilingExpert use of profiling tools including nvprof, NVIDIA Nsight, PyTorch Profiler. Ability to identify performance bottlenecks and correlate metrics to optimization opportunities.

Deep Learning System OptimizationComprehensive understanding of training and inference optimization techniques including mixed-precision training, quantization, sparsity, and model compression.

Large Language Model SystemsExperience working with large language model architectures, transformer optimization, attention mechanism acceleration, and LLM inference serving.

Nice to have

Triton or CUTLASSFamiliarity with modern kernel programming frameworks like Triton or CUTLASS that enable rapid kernel development with automatic optimization.

Distributed TrainingExperience with distributed training frameworks including data parallelism, model parallelism, and pipeline parallelism across multi-node GPU clusters.

Agentic AI SystemsUnderstanding of agentic frameworks, autonomous agent architectures, and LLM-driven workflow automation systems.

Open Source ContributionsActive contributions to open-source deep learning projects such as vLLM, DeepSpeed, PyTorch, or similar frameworks demonstrates community engagement and technical depth.

Technical Writing and CommunicationExperience publishing research papers, technical blog posts, or conference presentations demonstrating ability to communicate complex optimization techniques to technical audiences.

Machine Learning InfrastructureKnowledge of ML infrastructure, MLOps, monitoring, and production deployment of machine learning systems.

Compensation & benefits

Salary

USD 200,000 – 265,000 (annual)

Stock options

Available

Benefits

Equity and Stock Options

Competitive stock option grants allowing you to participate in Snowflake's growth as a publicly-traded company in the high-growth cloud computing sector.

Comprehensive Health Insurance

Medical, dental, and vision coverage with company subsidies for employee, spouse, and family plans.

401(k) Retirement Plan

Employer-matched 401(k) retirement savings plan with competitive matching percentages.

Unlimited Paid Time Off

Flexible vacation policy with unlimited paid time off to support work-life balance for engineering teams.

Professional Development

Learning and development budget for conferences, courses, and training to stay current with emerging AI and GPU optimization techniques.

Wellness and Fitness Programs

Gym memberships, wellness initiatives, and mental health resources supporting overall employee well-being.

Parental Leave

Generous parental leave policies supporting work-life balance during major life transitions.

Remote Work Flexibility

Flexible work arrangements supporting distributed team collaboration and remote work options.

Interview process

1
Initial Screening Call — Conversation with recruiter covering background, career goals, and alignment with the AI Research team's mission. Discussion of relevant GPU optimization experience and technical interests.
2
Technical Phone Screen — Technical interview with a senior engineer assessing GPU architecture knowledge, CUDA proficiency, and understanding of deep learning system optimization. Discussion of specific optimization challenges and problem-solving approaches.
3
Deep Dive Technical Interview — Comprehensive technical assessment with multiple engineers covering GPU kernel optimization techniques, CUTLASS/Triton experience, performance profiling methodologies, and analysis of real optimization problems from Snowflake's codebase.
4
System Design and Research Discussion — Conversation focused on large-scale system design, optimization strategy for LLM training and inference, and familiarity with recent research in GPU acceleration. Discussion of approaches to multi-GPU scaling and optimization trade-offs.
5
Manager and Team Fit Interview — Discussion with direct manager and potentially team members about collaboration style, research interests, and vision for AI systems development. Evaluation of cross-functional communication and work in dynamic environments.
6
Executive or Lead Engineer Conversation — Optional final-stage conversation with research leadership to discuss career aspirations, long-term goals in AI research, and Snowflake's strategic direction in generative AI systems.