dron's garden!

❯

❯

dissertation

Last modified: Apr 30, 20251 min read

introduction-preparation
introduction/preparation

characterizing modern ML research

Scaling and The Need For Compute

in the past decade, the success of deep learning methods has changed the nature of the field of machine learning. alexnet already had orders of magnitude more parameters than previous handcrafted techniques for the task — and there were many more orders of magnitude to come.

deep learning, with its hunger for large matrix multiplications, has co-evolved with GPU technology. alexnet was trained on two consumer-grade Nvidia GTX 580 GPUs, which had only 3GB VRAM each and 1.5 teraflops/second at float32. Compared to the A100 chip, the workhorse data center GPU that many academic labs maintain, has either 40 or 80 GB VRAM and 20 teraflops/second. the models have also grown: the largest model (denoted “extra large”) in the GPT-2 family had 1.5B non-embedding parameters. the smallest model in the LLama 3 family has 8B parameters (a “Light-weight, ultra-fast model you can run anywhere.”).

most modern, practical work thus requires significant compute resources, and the resource requirements continue to grow. even more compute is required to facilitate experimentation and robust empirical tests. training a single model or testing a single architecture may require relatively modest resources; deriving empirical scaling laws or comparisons between multiple models can take significantly more compute.

compute requirements vary between research fields and styles, but the broad hunger for compute is impossible to ignore.

There are 3 main ways research labs acquire compute:

Renting on-demand cloud GPUs from providers like AWS, GCP, or Azure

Purchasing consumer-grade desktop GPUs attached to local workstations

Reserving capacity from on-premises data centers (clusters)

Using desktop GPUs is common among robotics labs, due to the low latency required to control robots in real time. However, these chips are limited by their desktop requirements: they cannot offer the same performance or scale as a data center chip.

Renting cloud GPUs has no upfront cost and no fixed costs, making it easy for independent researchers and smaller labs with less institutional support. The on-demand rentals make it easy to adapt to compute needs at any given time and always use up-to-date hardware. However, renting cloud GPUs for sustained, long-term use, as would be the case for a larger research lab, is expensive. Cloud GPUs are also subject to availability constraints from the provider, which can be unpredictable and difficult to plan around.

Because of these, research institutions — particularly universities, established labs, and even some independent research organizations — have chosen to invest in on-premises GPU clusters. For sustained high-volume workloads, it is the more cost-effective solution despite the high upfront capital expenditure. High-profile examples that take this approach include MIT, UC Berkeley, and our very own University of Cambridge.

These large centralized clusters beget the need for robust systems that administer access to the cluster and manage the workloads that are submitted by users. This is the problem of cluster management.

compute needs ⇒ large clusters ⇒ cluster management

”scaling laws” and larger-scale experiments

why on-prem clusters are common (uni/lab/independent)

outcome: why is cluster management necessary?

outcome: who is cluster management for? (stakeholders)

focus on iteration speed + fast feedback loops ⇒ notebooks dominate

ML research is fast and hacky

so notebooks are useful

how do notebooks work? how do they support fast feedback loops?

outcome: why should a cluster manager target the notebook use-case?

case study: SLURM

SLURM treats GPUs poorly

SLURM treats notebooks poorly

outcome: motivation for design goals

design goals for bulletin (my cluster manager)

functional requirements (declarative spec)

non-functional requirements/goals (ergonomics/performance)

target notebooks (stakeholder: user, metric: steps to launch notebook job?)

easy set-up + good defaults (stakeholder: sysadmin, metric: unclear?)

better utilization (stakeholder: sysadmin)

lower latency/time-to-first-loss (stakeholder: user)

differences with SLURM

outcome: set up goals of correctness, performance, ergonomics

starting point

addendum: how is the CHAI cluster set up?
Link to original

implementation
implementation

how bulletin works

architecture overview (client/server/worker)

Bulletin was built around a client/server/worker architecture.

Clients submit jobs for remote execution on a compute node. They specify the resource and time requirements for the job.

Workers execute jobs on their individual compute node. Each compute node has a worker that waits for jobs from the server and executes the received jobs with the specified resource and time limits.

The server coordinates between clients and workers. It has many responsibilities, including job scheduling, deployment and signalling, monitoring worker status, and providing the interface to clients.

This architecture reflects Bulletin’s role as an interface for users to interact with the cluster. It conveniently maps the functional units of a workload manager to the real machines they will run on.

Bulletin is a distributed system with shared state. There are various ways of handling shared state in a distributed system, with different fault tolerance guarantees and failure assumptions. I use a centralized PostgreSQL database as the single source of truth. This allows me to inherit its concurrency model and guarantees, but it also becomes a single point of failure. I decided that this tradeoff was worth it, for two reasons:

PostgreSQL has been battle-tested in many systems for many years: it is unlikely our modest load will cause failures in the database itself.

The server and Postgres are colocated on the same machine. If that machine fails, the users’ interface to the cluster will be blocked, removing the possibility of entering an inconsistent state.

A failure in Postgres is likely to be much easier to debug then a bespoke system for managing distributed state, due to its extensive documentation and history. It is plausible that system administrators will have prior experience working with Postgres. The same is not true for a bespoke system.

The final architectural requirement is a way to communicate between nodes. Bulletin also operates assuming a shared filesystem such as a NAS, which is very often the case. The shared filesystem is used to transport the execution payloads and results, which avoids needing to transmit potentially large payloads via the server.

Bulletin primarily uses gRPC and Protocol Buffers (protobufs) for message passing and signalling. gRPC was chosen to implement the API due to its support of long-lived streaming RPCs, connection management tools, and tight integration with golang. gRPC and protobufs work together seamlessly. One disadvantage is the need to keep the protos and SQL schemas in sync with one another, but this conversion would be required even if JSON were used as the underlying message. The server runs a gRPC service, with the workers and clients making RPCs to the server.

python dispatch (client)

One of the primary design goals for Bulletin is to make it easy to submit jobs from a notebook. To achieve this, we bridge the gap between the locally oriented, highly stateful model of a local notebook with remote execution on GPUs.

One key mismatch between local and remote execution is state. When executing locally, variables set inside a notebook are accessible throughout the notebook, and can be updated globally. This allows the user continual flexibility in how they access the intermediate and final results of their computations.

For example, say I was sampling from an image generation diffusion model. I might write a function and run it to see the final results, but then I might want to see the intermediate states in sampling. The permissive scope of notebooks would allow for me to append to a list of image states and visualize the list outside the function, requiring no changes to any other calling code (the interface hasn’t changed!). This permissive scope and global mutability is part of what makes notebooks so appealing for rapid experimentation and prototyping.

Thus, Bulletin’s dispatch operates transparently on the functions called within a notebook. It offers a clear interface intended for use as a decorator: any function that should execute remotely can be tagged with a @to_remote decorator that accepts the relevant resource requests. When the function is called, it executes on a remote machine with all state information. On completion, it updates the local state with the new remote state. Bulletin even bridges the conflict between local CPU and remote GPU execution by moving PyTorch Tensors to the GPU automatically on the remote machine, and moving them back to the CPU on completion. This lets users use remote compute resources as if they were local.

To achieve this, Bulletin intercepts the local tagged function call, serializes the resulting closure and its environment to disk, and sends a message to the server submitting a job. Importantly, Python’s default serialization module pickle serializes many internal structures by reference. This means that when certain structures are loaded in a fresh execution environment, these references will point to nothing. To fix this, I use an open-source library cloudpickle that serializes all objects by value.

However, simply serializing and executing remotely only partially bridges the inferential gap. cloudpickle does not expose facilities to update

Though they are usually referred to as functions, functions defined in a notebook are actually closures, in that they can refer to their environment. When the closure is called locally,

When the closure is executed, it is first deserializes any PyTorch Tensors (or anything that implements a .to(device) method)

matching the user’s mental model

(de-)serializing closures and their environments

job execution (worker)

crash-recover model

cgroups for resource limits

go context timeouts for time limits

stdout/stderr streaming? (TODO)

scheduling + allocation (server)

goroutine model for scheduler

using postgres isolation levels

worker connection management/failure detection (keepalives)

authentication (TODO)
Link to original

evaluation

correctness

unit tests of dispatch

system integration tests (TODO: formalize)

clock mocking for targeted dist-sys tests? (TODO)

ergnomics (TODO)

user flows? for sysadmin and user?

performance

data collection/analysis from CHAI cluster

probability model (TODO)

sample generation

run representative jobs on bulletin and slurm

compare: time to first loss (latency)

compare: overall GPU util

compare: overall time to completion

significance levels

outcome (hopefully!): bulletin performs better on real ML workloads

Graph View

evaluation
correctness
unit tests of dispatch
system integration tests (TODO: formalize)
clock mocking for targeted dist-sys tests? (TODO)
ergnomics (TODO)
user flows? for sysadmin and user?
performance
data collection/analysis from CHAI cluster
run representative jobs on bulletin and slurm
outcome (hopefully!): bulletin performs better on real ML workloads

Backlinks

No backlinks found

Created with Quartz v4.4.0 © 2025

GitHub
Discord Community