News & Ideas | At Scale Conferences

Data, Systems and Networking

Product Reliability in Google Maps

While our organization excelled at maintaining server SLOs for Google Maps, we discovered that many user-impacting incidents, particularly those stemming from client-side issues like mobile app rollouts, remained undetected by server-centric monitoring. This realization prompted a strategic shift towards product reliability, prioritizing the end-user experience. This talk will discuss how we navigated this transition, sharing […]

WATCH VIDEO

Data, Systems and Networking

Journey to 1000 Models: Scaling Instagram’s algorithm without the Reliability Nightmare

At the beginning of 2023, Instagram had O(10) gpu models, a manual release process, and a manual monitoring setup. This talk will be centered around our journey to 1000 models: the bumps along the road and the foundational work built to make monitoring model health faster and more accurate. We’ll be going over model registry, […]

WATCH VIDEO

Data, Systems and Networking

Building KVStore for ML Workloads at Pinterest

This presentation introduces Pinterest’s KVStore, a distributed key-value store designed to support machine learning workloads that are central to Pinterest’s functions. KVStore enables efficient low-latency ML feature serving with various data update methods. KVStore is crucial for Pinterest’s AI/ML-driven platform, evolving to meet business needs with a focus on reliability and efficiency at scale.

WATCH VIDEO

Data, Systems and Networking

Splitting the Monolith

For over a decade, Facebook and most other Meta products have been powered by a single monolithic PHP application. Since 2022, we have been investing in a multitenancy framework to allow product specialization with less operational overhead. This has allowed Meta to move fast in generative AI, by providing a familiar development environment for our […]

WATCH VIDEO

Data, Systems and Networking

Turbocharging AI/ML workloads: Revving Up Speed and Resilience

Race cars are built for speed and resilience, equipped with cutting-edge features to reach high velocities while maintaining a firm grip on the perilous track. What if we could apply similar features to boost the speed and resilience of AI/ML jobs running over complex networking fabrics? In this session, we’ll dive into the key networking […]

WATCH VIDEO

Data, Systems and Networking

How We’re Scaling Discovery at Netflix Reliably

In the dynamic realm of streaming services, reliability and scalability are imperatives. This talk unveils the sophisticated architecture of Netflix’s Member Discovery System, known as Mosaic, which powers key member-facing pages. Discover the innovative strategies that ensure Mosaic’s robustness and reliability, and learn how Netflix sets the standard in delivering a seamless user experience to […]

WATCH VIDEO

Data, Systems and Networking

AI Hardware Reliability at Scale

This talk will describe our journey with AI hardware reliability (GPU/Silicon) running large scale training and inference in Meta. It will highlight our efforts across the ecosystem, covering vendor systems and our own custom silicon efforts to run AI hardware reliably at scale. For SW/Services audience, this will provide a under-the-hood look into how AI […]

WATCH VIDEO

Data, Systems and Networking

Experience Operating Large GPU Clusters at Organizational Scale

We outline Nvidia’s experience managing a large-scale internal GPU compute platform spanning multiple heterogeneous clusters. The platform supports thousands of users and hundreds of project accounts, handling a diverse mix of training and batch inference workloads across various research fields. We focus on three key challenges: researcher productivity, resource utilization, and operational efficiency. To improve […]

WATCH VIDEO

Data, Systems and Networking

Model Freshness and Its Infra Implications

Meta’s recommendation systems rely on “freshness” – the speed at which user interaction signals are ingested, trained, and utilized. To improve model freshness, Meta developed solutions addressing scaling, serving footprints, and diverse architectures.

WATCH VIDEO

Data, Systems and Networking

A Planet-Scale Computer – Abstract Away Regions via Global Service Placer (GSP)

Both public clouds and our hyperscale private cloud have evolved into complex infrastructures with millions of servers spanning numerous global data center regions. Leaving users to manage the complexity of deploying global services across regions incurs significant operational toil and often leads to suboptimal outcomes. Users must select regions, align global traffic routing with service […]

WATCH VIDEO

Data, Systems and Networking

Advancing Flash Storage @ Meta

The growth of data and need for increased power efficiency are leading to innovative storage solutions. HDDs have been growing in density, but not performance, and TLC flash remains at a price point that is restrictive for scaling. QLC technology addresses these challenges by forming a middle tier between HDDs and TLC SSDs. At Meta […]

WATCH VIDEO

Data, Systems and Networking

Challenges with Ultra-low Latency LLM Inference at Scale

In this talk, we will discuss the challenges of running ultra-low latency Large Language Model (LLM) inference at scale. We will cover the unique challenges of LLM inference, such as large model sizes, KV Caching. We will also discuss the challenges of scaling LLM inference to handle large volumes of requests, including the need for […]

WATCH VIDEO

LATEST ON @SCALE

Product Reliability in Google Maps

Journey to 1000 Models: Scaling Instagram’s algorithm without the Reliability Nightmare

Building KVStore for ML Workloads at Pinterest

Splitting the Monolith

Turbocharging AI/ML workloads: Revving Up Speed and Resilience

How We’re Scaling Discovery at Netflix Reliably

AI Hardware Reliability at Scale

Experience Operating Large GPU Clusters at Organizational Scale

Model Freshness and Its Infra Implications

A Planet-Scale Computer – Abstract Away Regions via Global Service Placer (GSP)

Advancing Flash Storage @ Meta

Challenges with Ultra-low Latency LLM Inference at Scale