Gang Scheduling for Llama

The rapid advancement of AI has necessitated a fundamental shift in infrastructure, moving from homogenous workloads that fit within a single server to multi-host workloads requiring tight container coordination across multiple servers. This talk explores the motivations and design principles behind this shift, focusing on the implementation of first-class support for gang scheduling at all layers of the system. We delve into the key components of this design, including Twine and the Resource Allowance System (RAS), and examine how they enable AI serving schemes that employ various forms of parallelism—such as pipeline, context, tensor, and expert parallelism—requiring container shared fate properties and network topology-aware allocation. By addressing these challenges, we aim to provide insights into building scalable and reliable systems that meet the demands of modern AI workloads.

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy

OSZAR »