my photo

Stuart Sul

ssul (at) cs.stanford.edu

Hi đź‘‹

I’m a master’s student in Computer Science at Stanford, advised by the amazing Chris Ré in the Hazy Research lab and supported by the Kwanjeong Scholarship. I am also a part-time Research Scientist at Cursor.

My research focuses on ML systems. I design hardware-aware abstractions and methods to accelerate training and inference of large language models. Some of my work includes megakernels, efficient MXFP8 training, GPU networking, and ThunderKittens.

Prior to Stanford and Cursor, I co-founded Blux, an AI startup specializing in empowering Korean e-commerce through AI-driven personalization. Blux raised $3M+ and is personalizing over 10 million Korean users’ online journeys monthly.

In addition, I am an electric guitarist and composer. I have released two albums and performed as lead guitarist and producer for multiple rock bands since 2008.


Experience

Research Scientist @ Cursor

June 2025 - Present

Building in-house models and optimizing AI kernels for large-scale training and inference. Check out my recent blog post on MXFP8 MoE kernels.

Research Assistant @ Stanford AI Lab

December 2024 - Present

Advised by Prof. Chris RĂ© at Hazy Research. Working on ThunderKittens, low-latency megakernel, high-throughput megakernel and GPU networking.

Co-Founder and CTO @ Blux

July 2021 - August 2023

Blux provides real-time recommender systems for e-commerce. Blux raised $3M+ and is personalizing over 10 million Korean users’ online journeys monthly.

I was the only technical co-founder until we acquired our first paying customer. I built everything (server, infrastructure, ML) from scratch. Over time, I recruited and led a team of 15+ engineers.

Research Assistant @ Seoul National University

June 2020 - September 2020

Worked with Prof. Jae W. Lee at the Architecture and Code Optimization Lab to design and implement a novel embedding clustering algorithm in C++, reducing main memory access by up to 44% in commercial deep learning recommendation models (DLRMs). This research resulted in a paper accepted at ASPLOS 2021.

Research Assistant @ Seoul National University

December 2019 - February 2020

Worked with Prof. Kyogu Lee at the Music and Audio Research Group on audio processing architecture using CNN, cGAN, super-resolution, and the Griffin-Lim algorithm for commercial singing voice synthesis. Also performed millisecond-precision data labeling on audio and MIDI data using Logic Pro and Python for 30+ K-pop songs.

Sergeant, Combat Medic @ US Army

November 2017 - August 2019

Served as a Combat Medic (68W) in a U.S. Army cavalry unit for 21 months through the Korean Augmentation to the United States Army (KATUSA) program, fulfilling South Korea’s mandatory military service requirement.


Projects

Megakernels

Runs FlashMLA, Llama 3 1B, and Llama 3 70B in a single fused megakernel. 500+ stars on GitHub.

ThunderKittens

Helps you write speedy GPU kernels for AI. 2.7k+ stars on GitHub and adopted by Cursor, Together AI, Jump Trading, Modular, TileLang, and Nvidia CuTe 4.0.

MERCI

Fast embedding reduction algorithm for deep learning recommendation models (DLRMs) and other systems with very large embedding tables.

ELF32 Dynamic Linker for Raspberry Pi

ELF dynamic linker on bare metal, allowing you to port shared libraries.

Taught as a lab at Stanford University: CS 240LX: ELF and Dynamic Linker.

Co-Chuck

Collaborative online environment that allows you to “code” your music.

SampyoNet

Deep learning model, inference engine, and mobile app combined for gravel quality assessment. Provided to Sampyo for production concrete manufacturing.

LLVM Compiler Optimization

Achieved 2nd place among 13 teams in an LLVM optimization competition at Seoul National University.


Writing

We Bought the Whole GPU, So We’re Damn Well Going to Use the Whole GPU

Hazy Research

How Many Llamas Can Dance in the Span of a Kernel?

Hazy Research

One Kernel for All Your GPUs

Hazy Research

1.5x Faster MoE Training with Custom MXFP8 Kernels Built From Scratch

Cursor

Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B

Hazy Research


Education

Stanford University

M.S. in Computer Science

  • Recipient of the merit-based Kwanjeong scholarship ($60,000)

Seoul National University

B.S. in Computer Science and Engineering

  • GPA: 4.21/4.3 (class rank: 1st)
  • Recipient of the National Science and Engineering Scholarship for Gifted Students (full scholarship)
  • Member of the College of Engineering Honor Society