Gang Scheduling for Llama

Andrei Darabanov

Anca Agape

TOPIC: Data, Systems and Networking

@SCALE SERIES: Systems @Scale

TYPE: video

YEAR: 2025

TAGS:

The rapid advancement of AI has necessitated a fundamental shift in infrastructure, moving from homogenous workloads that fit within a single server to multi-host workloads requiring tight container coordination across multiple servers. This talk explores the motivations and design principles behind this shift, focusing on the implementation of first-class support for gang scheduling at all layers of the system. We delve into the key components of this design, including Twine and the Resource Allowance System (RAS), and examine how they enable AI serving schemes that employ various forms of parallelism—such as pipeline, context, tensor, and expert parallelism—requiring container shared fate properties and network topology-aware allocation. By addressing these challenges, we aim to provide insights into building scalable and reliable systems that meet the demands of modern AI workloads.