introduction-preparation
introduction/preparation
characterizing modern ML research
Scaling and The Need For Compute
in the past decade, the success of deep learning methods has changed the nature of the field of machine learning. alexnet already had orders of magnitude more parameters than previous handcrafted techniques for the task — and there were many more orders of magnitude to come.
deep learning, with its hunger for large matrix multiplications, has co-evolved with GPU technology. alexnet was trained on two consumer-grade Nvidia GTX 580 GPUs, which had only 3GB VRAM each and 1.5 teraflops/second at float32. Compared to the A100 chip, the workhorse data center GPU that many academic labs maintain, has either 40 or 80 GB VRAM and 20 teraflops/second. the models have also grown: the largest model (denoted “extra large”) in the GPT-2 family had 1.5B non-embedding parameters. the smallest model in the LLama 3 family has 8B parameters (a “Light-weight, ultra-fast model you can run anywhere.”).
most modern, practical work thus requires significant compute resources, and the resource requirements continue to grow. even more compute is required to facilitate experimentation and robust empirical tests. training a single model or testing a single architecture may require relatively modest resources; deriving empirical scaling laws or comparisons between multiple models can take significantly more compute.
compute requirements vary between research fields and styles, but the broad hunger for compute is impossible to ignore.
There are 3 main ways research labs acquire compute:
- Renting on-demand cloud GPUs from providers like AWS, GCP, or Azure
- Purchasing consumer-grade desktop GPUs attached to local workstations
- Reserving capacity from on-premises data centers (clusters)
Using desktop GPUs is common among robotics labs, due to the low latency required to control robots in real time. However, these chips are limited by their desktop requirements: they cannot offer the same performance or scale as a data center chip.
Renting cloud GPUs has no upfront cost and no fixed costs, making it easy for independent researchers and smaller labs with less institutional support. The on-demand rentals make it easy to adapt to compute needs at any given time and always use up-to-date hardware. However, renting cloud GPUs for sustained, long-term use, as would be the case for a larger research lab, is expensive. Cloud GPUs are also subject to availability constraints from the provider, which can be unpredictable and difficult to plan around.
Because of these, research institutions — particularly universities, established labs, and even some independent research organizations — have chosen to invest in on-premises GPU clusters. For sustained high-volume workloads, it is the more cost-effective solution despite the high upfront capital expenditure. High-profile examples that take this approach include MIT, UC Berkeley, and our very own University of Cambridge.
These large centralized clusters beget the need for robust systems that administer access to the cluster and manage the workloads that are submitted by users. This is the problem of cluster management.
compute needs ⇒ large clusters ⇒ cluster management
”scaling laws” and larger-scale experiments
why on-prem clusters are common (uni/lab/independent)
outcome: why is cluster management necessary?
outcome: who is cluster management for? (stakeholders)
focus on iteration speed + fast feedback loops ⇒ notebooks dominate
ML research is fast and hacky
so notebooks are useful
how do notebooks work? how do they support fast feedback loops?
outcome: why should a cluster manager target the notebook use-case?
case study: SLURM
SLURM treats GPUs poorly
SLURM treats notebooks poorly
outcome: motivation for design goals
design goals for bulletin (my cluster manager)
functional requirements (declarative spec)
non-functional requirements/goals (ergonomics/performance)
- target notebooks (stakeholder: user, metric: steps to launch notebook job?)
- easy set-up + good defaults (stakeholder: sysadmin, metric: unclear?)
- better utilization (stakeholder: sysadmin)
- lower latency/time-to-first-loss (stakeholder: user)
differences with SLURM
outcome: set up goals of correctness, performance, ergonomics
starting point
addendum: how is the CHAI cluster set up?
Link to original
implementation
implementation
how bulletin works
architecture overview (client/server/worker)
Bulletin was built around a client/server/worker architecture.
- Clients submit jobs for remote execution on a compute node. They specify the resource and time requirements for the job.
- Workers execute jobs on their individual compute node. Each compute node has a worker that waits for jobs from the server and executes the received jobs with the specified resource and time limits.
- The server coordinates between clients and workers. It has many responsibilities, including job scheduling, deployment and signalling, monitoring worker status, and providing the interface to clients.
This architecture reflects Bulletin’s role as an interface for users to interact with the cluster. It conveniently maps the functional units of a workload manager to the real machines they will run on.
Bulletin is a distributed system with shared state. There are various ways of handling shared state in a distributed system, with different fault tolerance guarantees and failure assumptions. I use a centralized PostgreSQL database as the single source of truth. This allows me to inherit its concurrency model and guarantees, but it also becomes a single point of failure. I decided that this tradeoff was worth it, for two reasons:
- PostgreSQL has been battle-tested in many systems for many years: it is unlikely our modest load will cause failures in the database itself.
- The server and Postgres are colocated on the same machine. If that machine fails, the users’ interface to the cluster will be blocked, removing the possibility of entering an inconsistent state.
- A failure in Postgres is likely to be much easier to debug then a bespoke system for managing distributed state, due to its extensive documentation and history. It is plausible that system administrators will have prior experience working with Postgres. The same is not true for a bespoke system.
The final architectural requirement is a way to communicate between nodes. Bulletin also operates assuming a shared filesystem such as a
NAS
, which is very often the case. The shared filesystem is used to transport the execution payloads and results, which avoids needing to transmit potentially large payloads via the server.Bulletin primarily uses gRPC and Protocol Buffers (protobufs) for message passing and signalling. gRPC was chosen to implement the API due to its support of long-lived streaming RPCs, connection management tools, and tight integration with
golang
. gRPC and protobufs work together seamlessly. One disadvantage is the need to keep the protos and SQL schemas in sync with one another, but this conversion would be required even if JSON were used as the underlying message. The server runs a gRPC service, with the workers and clients making RPCs to the server.python dispatch (client)
One of the primary design goals for Bulletin is to make it easy to submit jobs from a notebook. To achieve this, we bridge the gap between the locally oriented, highly stateful model of a local notebook with remote execution on GPUs.
One key mismatch between local and remote execution is state. When executing locally, variables set inside a notebook are accessible throughout the notebook, and can be updated globally. This allows the user continual flexibility in how they access the intermediate and final results of their computations.
For example, say I was sampling from an image generation diffusion model. I might write a function and run it to see the final results, but then I might want to see the intermediate states in sampling. The permissive scope of notebooks would allow for me to append to a list of image states and visualize the list outside the function, requiring no changes to any other calling code (the interface hasn’t changed!). This permissive scope and global mutability is part of what makes notebooks so appealing for rapid experimentation and prototyping.
Thus, Bulletin’s dispatch operates transparently on the functions called within a notebook. It offers a clear interface intended for use as a decorator: any function that should execute remotely can be tagged with a
@to_remote
decorator that accepts the relevant resource requests. When the function is called, it executes on a remote machine with all state information. On completion, it updates the local state with the new remote state. Bulletin even bridges the conflict between local CPU and remote GPU execution by moving PyTorch Tensors to the GPU automatically on the remote machine, and moving them back to the CPU on completion. This lets users use remote compute resources as if they were local.To achieve this, Bulletin intercepts the local tagged function call, serializes the resulting closure and its environment to disk, and sends a message to the server submitting a job. Importantly, Python’s default serialization module
pickle
serializes many internal structures by reference. This means that when certain structures are loaded in a fresh execution environment, these references will point to nothing. To fix this, I use an open-source librarycloudpickle
that serializes all objects by value.However, simply serializing and executing remotely only partially bridges the inferential gap.
cloudpickle
does not expose facilities to updateThough they are usually referred to as functions, functions defined in a notebook are actually closures, in that they can refer to their environment. When the closure is called locally,
When the closure is executed, it is first deserializes any PyTorch Tensors (or anything that implements a
.to(device)
method)matching the user’s mental model
(de-)serializing closures and their environments
job execution (worker)
crash-recover model
cgroups
for resource limits
go
context timeouts for time limitsstdout/stderr streaming? (TODO)
scheduling + allocation (server)
goroutine model for scheduler
using postgres isolation levels
worker connection management/failure detection (keepalives)
authentication (TODO)
Link to original