Skip to content

CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)

Course Overview

  • University: Carnegie Mellon University
  • Prerequisites: No strict prerequisites; an intro ML background and hands-on deep learning training experience are recommended; familiarity with PyTorch helps; basic CUDA/GPU knowledge will significantly improve the learning curve
  • Programming Language: Python (systems and kernel-level topics involve CUDA/hardware concepts)
  • Course Difficulty: 🌟🌟🌟🌟
  • Estimated Study Hours: 80-120 hours

This course takes a systems-first view of modern machine learning and LLM infrastructure. The core question it repeatedly answers is: how does a model written in a high-level framework (e.g., PyTorch) get decomposed into low-level kernels, and how is it executed efficiently on heterogeneous accelerators (GPUs/TPUs) and in distributed environments. The syllabus covers GPU programming, ML compilers, graph-level optimizations, distributed training and auto-parallelization, and LLM serving and inference acceleration. It is a strong fit if you want to connect “framework-level experience” with “kernels, compilation, hardware, and cluster execution.”

The workload is organized around consistent pre-lecture reading assignments (paper reviews) and a team-based final course project (proposal, presentation, report). For self-study, it is best to follow the schedule week by week rather than treating it as a slide-only course.

Topics Covered

The course is structured as lectures, with major themes including:

  1. ML systems fundamentals via TensorFlow/PyTorch (abstractions, execution models)
  2. GPU architecture and CUDA programming (memory, performance tuning)
  3. Transformer and attention case studies (FlashAttention and IO-aware attention)
  4. Advanced CUDA techniques (warp specialization, mega kernels)
  5. ML compilation (tile-based DSLs like Triton, kernel auto-tuning, graph-level optimizations, superoptimization such as Mirage)
  6. Parallelization and distributed training (ZeRO/FSDP, model/pipeline parallelism, auto-parallelization such as Alpa)
  7. LLM serving and inference (batching, PagedAttention, RadixAttention, speculative decoding)
  8. Post-training and architectures (PEFT like LoRA/QLoRA, MoE architectures/kernels/parallelism)

Course Resources