my photo

Stuart Sul

ssul (at) cs.stanford.edu

Hi đź‘‹

I’m a PhD student in Computer Science at Stanford, advised by the amazing Chris Ré as part of Hazy Research and the Stanford AI Lab. I’m also an ML researcher at Cursor.

My research focuses on ML systems. I design hardware-aware algorithms and abstractions to accelerate AI model training and inference. Some of my work includes megakernels, efficient MXFP8/NVFP4 training, GPU networking, and ThunderKittens.

Prior to Stanford and Cursor, I co-founded Blux, an AI B2B startup specializing in empowering Korean e-commerce through AI-driven personalization. Blux raised $3M+ and is personalizing over 10 million Korean users’ online journeys monthly.

In addition, I am an electric guitarist and composer. I have released two albums and performed as lead guitarist and producer for multiple rock bands since 2008.


Experience

Research Scientist @ Cursor

June 2025 - Present

Building Composer and optimizing AI kernels for large-scale training and inference. Check out my blog post on MXFP8 MoE kernels.

Research Assistant @ Stanford AI Lab

September 2024 - Present

Advised by Prof. Chris RĂ© at Hazy Research. Working on ThunderKittens, low-latency megakernel, high-throughput megakernel and GPU networking.

Co-Founder and CTO @ Blux

July 2021 - August 2023

Blux provides real-time recommender systems for e-commerce. Blux raised $3M+ and is personalizing over 10 million Korean users’ online journeys monthly.

I was the only technical co-founder until we acquired our first paying customer. I built everything (server, infrastructure, ML) from scratch. Over time, I recruited and led a team of 15+ engineers.

Sergeant @ US Army

November 2017 - August 2019

Served as a Combat Medic (68W) in a U.S. Army cavalry unit for 21 months through the Korean Augmentation to the United States Army (KATUSA) program.


Projects

Megakernels

Runs FlashMLA, Llama 3 1B, and Llama 3 70B in a single fused megakernel. 600+ stars on GitHub.

ThunderKittens

Helps you write speedy GPU kernels for AI. 3,300+ stars on GitHub and adopted by Cursor, Together AI, Jump Trading, Modular, TileLang, and Nvidia CuTe 4.0.

MERCI

Fast embedding reduction algorithm for deep learning recommendation models (DLRMs) and other systems with very large embedding tables.

ELF32 Dynamic Linker for Raspberry Pi

ELF dynamic linker on bare metal, allowing you to port shared libraries.

Taught as a lab at Stanford University: CS 240LX: ELF and Dynamic Linker.

Co-Chuck

Collaborative online environment that allows you to “code” your music.

SampyoNet

Deep learning model, inference engine, and mobile app combined for gravel quality assessment. Provided to Sampyo for production concrete manufacturing.

LLVM Compiler Optimization

Achieved 2nd place among 13 teams in an LLVM optimization competition at Seoul National University.


Writing

Composer 2 Technical Report

Cursor, Mar 2026

ThunderKittens 2.0: Even Faster Kernels for Your GPUs

Hazy Research, Feb 2026

Loads and Loads of Fluffy Kittens

Hazy Research, Nov 2025

ParallelKittens: Simple and Fast Multi-GPU AI Kernels

Hazy Research, Nov 2025

We Bought the Whole GPU, So We’re Damn Well Going to Use the Whole GPU

Hazy Research, Sep 2025

How Many Llamas Can Dance in the Span of a Kernel?

Hazy Research, Sep 2025

One Kernel for All Your GPUs

Hazy Research, Sep 2025

1.5x Faster MoE Training with Custom MXFP8 Kernels Built From Scratch

Cursor, Aug 2025

Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B

Hazy Research, May 2025


Education

Stanford University

PhD in Computer Science

  • Advisor: Chris RĂ©

Stanford University

MS in Computer Science

  • Advisor: Chris RĂ©
  • Funded by Kwanjeong scholarship

Seoul National University

BS in Computer Science and Engineering

  • GPA: 4.21/4.3 (class rank: 1st)
  • Recipient of the National Science and Engineering Scholarship for Gifted Students (full scholarship)
  • Member of the College of Engineering Honor Society