1.
Introduction
2.
Array Programming Fundamentals
2.1.
What is an Array?
2.2.
Basic Operators
2.3.
Broadcasting
2.4.
Slicing
2.5.
Indexing
2.6.
Reshaping And Transposing
2.7.
Einsums
2.8.
Practice: Implementing An LLM's Forward Pass
3.
ML Compilers
3.1.
Jax vs PyTorch
3.2.
Eager Mode
3.3.
Optimizations
4.
Backward Pass
5.
On-Chip Parallelism
6.
Estimating Performance
6.1.
How to Compute It?
6.2.
Practical Example
6.3.
Roofline Model
6.4.
Practice Questions
7.
Distributed Computations
7.1.
Distributed Ops
7.1.1.
All-Gather
7.1.2.
All-Reduce and Reduce-Scatter
7.1.3.
All-To-All
7.2.
Sharding Strategies
7.2.1.
Data Parallelism
7.2.2.
Pipelining
7.2.3.
Fully Sharded Data Parallel (FSDP)
7.2.4.
Tensor Parallelism
7.2.5.
Practice Questions
8.
LLM Serving Optimizations
8.1.
Quality Neutral
8.1.1.
KV Caching
8.1.2.
Disaggregated Serving
8.1.3.
Speculative Decoding
8.1.4.
Flash Attention
8.2.
Quality Detrimental
8.2.1.
Quantization
9.
Mixture of Experts (MoE)
9.1.
Expert Sharding
9.2.
Expert Imbalance (TODO)
10.
Credits
Light
Rust
Coal
Navy
Ayu
ML Performance
Distributed Operations
Let's first review the three most important distributed operations.