LLM Serving Optimizations

This chapter covers an array of optimization techniques that are frequently used when serving LLMs. Some of these techniques can be applied to other types of models, and some can also be applied to training (for instance flash attention). We subdivide the techniques into two sub-parts: Quality Neutral and Quality Detrimental. Quality Neutral techniques improve latency or throughput without deteriorating the quality of the output while Quality Detrimental techniques have a negative impact on the quality of the model; they offer a tradeoff between latency and quality.