From 6d301d32263272dafa31dd8761de12d44d4d0540 Mon Sep 17 00:00:00 2001 From: Chris Llanwarne Date: Tue, 10 Feb 2026 15:21:16 -0500 Subject: [PATCH 1/9] WIP --- batch/worker-vm-image.md | 136 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 136 insertions(+) create mode 100644 batch/worker-vm-image.md diff --git a/batch/worker-vm-image.md b/batch/worker-vm-image.md new file mode 100644 index 00000000000..a39b5dadfeb --- /dev/null +++ b/batch/worker-vm-image.md @@ -0,0 +1,136 @@ +# Batch Worker VM Image + +## Background + +Batch worker VMs use a two-layer image system: + +1. **VM image** (this doc) -- A cloud-provider VM image (GCE image / Azure Shared Image Gallery) + with the base OS, Docker, GPU drivers, and the root Docker image pre-pulled. This is what each + worker VM boots from. + +2. **Docker worker image** (`Dockerfile.worker`) -- The container that actually runs the batch + worker process. It's built separately via `make batch-worker-image` and pulled onto workers at + startup. + +The VM image exists primarily so that worker VMs boot fast. Without it, every new worker would +need to install Docker, pull a ~2GB root image, install GPU drivers, etc. on every boot. By baking +all of that into a VM image, new workers go from creation to running jobs in under a minute. + +The VM image changes rarely (only when we need to update Docker, GPU drivers, or other +OS-level dependencies). The Docker worker image changes on every deploy. + +### How the build works + +The build scripts create a temporary VM from a stock Ubuntu image, run a startup script that +installs everything, wait for the VM to shut itself down, then snapshot its disk into a reusable +image. The temporary VM is deleted afterward. + +## GCP + +### Prerequisites + +- `gcloud` configured and authenticated with the target project +- `NAMESPACE` environment variable set (usually `default`) +- Access to the project's global-config (for project ID, zone, docker root image) + +### Building the image + +From the repo root: + +```bash +NAMESPACE=default batch/gcp-create-worker-image.sh +``` + +The script will show a confirmation prompt with the image name, version, project, and zone before +proceeding. + +### What gets installed (startup script) + +The GCP startup script (`build-batch-worker-image-startup-gcp.sh`) installs: + +- Google logging agent and Cloud Ops agent +- Docker CE + `docker-credential-gcr` +- NVIDIA drivers (535.183.01) and `nvidia-container-toolkit` +- Build tools (gcc-12, g++-12) +- Pre-pulls the `docker_root_image` from Artifact Registry + +The VM shuts itself down when the script completes. The build script polls for this, then snapshots +the disk. + +### Image naming + +- `default` namespace: `batch-worker-{VERSION}` (e.g. `batch-worker-17`) +- Other namespaces: `batch-worker-{NAMESPACE}-{VERSION}` + +### Bumping the version + +1. Increment `WORKER_IMAGE_VERSION` in `batch/gcp-create-worker-image.sh` +2. Run the build script as above +3. Update the hardcoded image reference in + `batch/batch/cloud/gcp/driver/create_instance.py` (search for `batch-worker-`) +4. Deploy batch + +### First-time setup + +If this is a brand new deployment, the worker image must be created before Batch can be deployed. +See `infra/gcp/README.md` -- the image creation step comes after `deploy_unmanaged` and before +downloading global-config: + +```bash +NAMESPACE=default $HAIL/batch/gcp-create-worker-image.sh +``` + +## Azure + +### Prerequisites + +- Azure CLI authenticated +- `$HAIL` environment variable pointing to the repo root +- Access to global-config secret (for subscription ID, resource group, location, docker prefix) +- The `batch-worker` managed identity must exist in the resource group + +### Building the image + +```bash +source $HAIL/devbin/functions.sh +$HAIL/batch/az-create-worker-image.sh +``` + +### What gets installed (startup script) + +The Azure startup script (`build-batch-worker-image-startup-azure.sh`) installs: + +- Docker CE +- Azure CLI +- Authenticates with Azure Container Registry via managed identity +- Pre-pulls the `docker_root_image` + +Note: Azure workers do not have GPU support baked into the VM image (unlike GCP). + +### Image naming + +Images are stored in an Azure Shared Image Gallery: + +- Gallery: `{RESOURCE_GROUP}_batch` +- Image definition: `batch-worker-22-04` +- Version: `0.0.{N}` (e.g. `0.0.14`) + +### Bumping the version + +1. Increment `WORKER_VERSION` in `batch/az-create-worker-image.sh` +2. Run the build script as above +3. Update the hardcoded image reference in + `batch/batch/cloud/azure/driver/create_instance.py` (search for `batch-worker-22-04/versions/`) +4. Deploy batch + +## Key files + +| File | Purpose | +|------|---------| +| `batch/gcp-create-worker-image.sh` | GCP build orchestration script | +| `batch/az-create-worker-image.sh` | Azure build orchestration script | +| `batch/build-batch-worker-image-startup-gcp.sh` | GCP VM startup/provisioning (Jinja2 template) | +| `batch/build-batch-worker-image-startup-azure.sh` | Azure VM startup/provisioning (Jinja2 template) | +| `batch/Dockerfile.worker` | Docker worker image (separate from the VM image) | +| `batch/batch/cloud/gcp/driver/create_instance.py` | Runtime: creates worker VMs using the GCP image | +| `batch/batch/cloud/azure/driver/create_instance.py` | Runtime: creates worker VMs using the Azure image | From e71aa1e8193071105ceb6e975b678c2886dd61e9 Mon Sep 17 00:00:00 2001 From: Chris Llanwarne Date: Thu, 30 Apr 2026 11:59:23 -0400 Subject: [PATCH 2/9] feedback --- .../services/batch}/worker-vm-image.md | 54 ++----------------- 1 file changed, 4 insertions(+), 50 deletions(-) rename {batch => dev-docs/services/batch}/worker-vm-image.md (63%) diff --git a/batch/worker-vm-image.md b/dev-docs/services/batch/worker-vm-image.md similarity index 63% rename from batch/worker-vm-image.md rename to dev-docs/services/batch/worker-vm-image.md index a39b5dadfeb..9741d0059e6 100644 --- a/batch/worker-vm-image.md +++ b/dev-docs/services/batch/worker-vm-image.md @@ -4,9 +4,8 @@ Batch worker VMs use a two-layer image system: -1. **VM image** (this doc) -- A cloud-provider VM image (GCE image / Azure Shared Image Gallery) - with the base OS, Docker, GPU drivers, and the root Docker image pre-pulled. This is what each - worker VM boots from. +1. **VM image** (this doc) -- A GCE VM image with the base OS, Docker, GPU drivers, and the root + Docker image pre-pulled. This is what each worker VM boots from. 2. **Docker worker image** (`Dockerfile.worker`) -- The container that actually runs the batch worker process. It's built separately via `make batch-worker-image` and pulled onto workers at @@ -50,7 +49,8 @@ The GCP startup script (`build-batch-worker-image-startup-gcp.sh`) installs: - Google logging agent and Cloud Ops agent - Docker CE + `docker-credential-gcr` -- NVIDIA drivers (535.183.01) and `nvidia-container-toolkit` +- NVIDIA drivers (535.183.01) and `nvidia-container-toolkit` — baked into every worker image, but + only activated at runtime on VMs with GPUs attached (see `is_gpu()` check in `worker.py`) - Build tools (gcc-12, g++-12) - Pre-pulls the `docker_root_image` from Artifact Registry @@ -80,57 +80,11 @@ downloading global-config: NAMESPACE=default $HAIL/batch/gcp-create-worker-image.sh ``` -## Azure - -### Prerequisites - -- Azure CLI authenticated -- `$HAIL` environment variable pointing to the repo root -- Access to global-config secret (for subscription ID, resource group, location, docker prefix) -- The `batch-worker` managed identity must exist in the resource group - -### Building the image - -```bash -source $HAIL/devbin/functions.sh -$HAIL/batch/az-create-worker-image.sh -``` - -### What gets installed (startup script) - -The Azure startup script (`build-batch-worker-image-startup-azure.sh`) installs: - -- Docker CE -- Azure CLI -- Authenticates with Azure Container Registry via managed identity -- Pre-pulls the `docker_root_image` - -Note: Azure workers do not have GPU support baked into the VM image (unlike GCP). - -### Image naming - -Images are stored in an Azure Shared Image Gallery: - -- Gallery: `{RESOURCE_GROUP}_batch` -- Image definition: `batch-worker-22-04` -- Version: `0.0.{N}` (e.g. `0.0.14`) - -### Bumping the version - -1. Increment `WORKER_VERSION` in `batch/az-create-worker-image.sh` -2. Run the build script as above -3. Update the hardcoded image reference in - `batch/batch/cloud/azure/driver/create_instance.py` (search for `batch-worker-22-04/versions/`) -4. Deploy batch - ## Key files | File | Purpose | |------|---------| | `batch/gcp-create-worker-image.sh` | GCP build orchestration script | -| `batch/az-create-worker-image.sh` | Azure build orchestration script | | `batch/build-batch-worker-image-startup-gcp.sh` | GCP VM startup/provisioning (Jinja2 template) | -| `batch/build-batch-worker-image-startup-azure.sh` | Azure VM startup/provisioning (Jinja2 template) | | `batch/Dockerfile.worker` | Docker worker image (separate from the VM image) | | `batch/batch/cloud/gcp/driver/create_instance.py` | Runtime: creates worker VMs using the GCP image | -| `batch/batch/cloud/azure/driver/create_instance.py` | Runtime: creates worker VMs using the Azure image | From 9364da4d5694fef08f1fcb19ad53c49d6574a30f Mon Sep 17 00:00:00 2001 From: Chris Llanwarne Date: Thu, 30 Apr 2026 12:41:36 -0400 Subject: [PATCH 3/9] separate first time from subsequent --- dev-docs/services/batch/worker-vm-image.md | 31 +++++++++++----------- 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/dev-docs/services/batch/worker-vm-image.md b/dev-docs/services/batch/worker-vm-image.md index 9741d0059e6..d2762e95fdd 100644 --- a/dev-docs/services/batch/worker-vm-image.md +++ b/dev-docs/services/batch/worker-vm-image.md @@ -26,13 +26,19 @@ image. The temporary VM is deleted afterward. ## GCP -### Prerequisites +### First time creation + +If this is a brand new deployment, the worker image must be created before Batch can be deployed. +See `infra/gcp/README.md` — the image creation step comes after `deploy_unmanaged` and before +downloading global-config. + +#### Prerequisites - `gcloud` configured and authenticated with the target project - `NAMESPACE` environment variable set (usually `default`) - Access to the project's global-config (for project ID, zone, docker root image) -### Building the image +#### Building the image From the repo root: @@ -43,7 +49,7 @@ NAMESPACE=default batch/gcp-create-worker-image.sh The script will show a confirmation prompt with the image name, version, project, and zone before proceeding. -### What gets installed (startup script) +#### What gets installed (startup script) The GCP startup script (`build-batch-worker-image-startup-gcp.sh`) installs: @@ -57,29 +63,22 @@ The GCP startup script (`build-batch-worker-image-startup-gcp.sh`) installs: The VM shuts itself down when the script completes. The build script polls for this, then snapshots the disk. -### Image naming +#### Image naming - `default` namespace: `batch-worker-{VERSION}` (e.g. `batch-worker-17`) - Other namespaces: `batch-worker-{NAMESPACE}-{VERSION}` -### Bumping the version +### Incrementing the version + +Increment the version before running the script — the version is part of the image name, so running +without incrementing would delete and replace the existing production image in place. 1. Increment `WORKER_IMAGE_VERSION` in `batch/gcp-create-worker-image.sh` -2. Run the build script as above +2. Run the build script (see [Building the image](#building-the-image) above) 3. Update the hardcoded image reference in `batch/batch/cloud/gcp/driver/create_instance.py` (search for `batch-worker-`) 4. Deploy batch -### First-time setup - -If this is a brand new deployment, the worker image must be created before Batch can be deployed. -See `infra/gcp/README.md` -- the image creation step comes after `deploy_unmanaged` and before -downloading global-config: - -```bash -NAMESPACE=default $HAIL/batch/gcp-create-worker-image.sh -``` - ## Key files | File | Purpose | From deca7313d865b617aa5e9d8459c031d5e5bc0054 Mon Sep 17 00:00:00 2001 From: Chris Llanwarne Date: Thu, 30 Apr 2026 16:56:41 -0400 Subject: [PATCH 4/9] Clarify version incrementing steps for image creation Emphasize the importance of incrementing the version before running the script to avoid replacing the production image. --- dev-docs/services/batch/worker-vm-image.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/dev-docs/services/batch/worker-vm-image.md b/dev-docs/services/batch/worker-vm-image.md index d2762e95fdd..e27b79791d9 100644 --- a/dev-docs/services/batch/worker-vm-image.md +++ b/dev-docs/services/batch/worker-vm-image.md @@ -70,14 +70,16 @@ the disk. ### Incrementing the version -Increment the version before running the script — the version is part of the image name, so running -without incrementing would delete and replace the existing production image in place. - 1. Increment `WORKER_IMAGE_VERSION` in `batch/gcp-create-worker-image.sh` + - Very Important! You MUST increment the version before running the script! The version is part of +the image name, so running without doing this would replace the current image relied on in prod. 2. Run the build script (see [Building the image](#building-the-image) above) + - Start with a custom NAMESPACE to make sure the image builds and deploys successfully. If it +looks good, move on to the `default` namespace (nb: remembering to double-check the image version). 3. Update the hardcoded image reference in `batch/batch/cloud/gcp/driver/create_instance.py` (search for `batch-worker-`) -4. Deploy batch +4. Test and run CI. +5. Deploy batch ## Key files From 85567373da07450bda79ba7e995880f2fc40c775 Mon Sep 17 00:00:00 2001 From: Chris Llanwarne Date: Thu, 30 Apr 2026 17:36:26 -0400 Subject: [PATCH 5/9] Update worker VM image documentation steps Clarify steps for running the build script and updating image references. --- dev-docs/services/batch/worker-vm-image.md | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/dev-docs/services/batch/worker-vm-image.md b/dev-docs/services/batch/worker-vm-image.md index e27b79791d9..0ac79831cdb 100644 --- a/dev-docs/services/batch/worker-vm-image.md +++ b/dev-docs/services/batch/worker-vm-image.md @@ -73,13 +73,20 @@ the disk. 1. Increment `WORKER_IMAGE_VERSION` in `batch/gcp-create-worker-image.sh` - Very Important! You MUST increment the version before running the script! The version is part of the image name, so running without doing this would replace the current image relied on in prod. -2. Run the build script (see [Building the image](#building-the-image) above) - - Start with a custom NAMESPACE to make sure the image builds and deploys successfully. If it -looks good, move on to the `default` namespace (nb: remembering to double-check the image version). -3. Update the hardcoded image reference in +2. Run the build script with a custom NAMESPACE, to make sure the image builds and deploys successfully. + - eg: `NAMESPACE=YOURNAME batch/gcp-create-worker-image.sh` +4. Update the hardcoded image reference in `batch/batch/cloud/gcp/driver/create_instance.py` (search for `batch-worker-`) -4. Test and run CI. -5. Deploy batch +5. Test deploy, and check things look good. +6. Repeat with `NAMESPACE=default` +7. Create PR, watch CI, deploy with the updated image. +8. NOTE: The rollout will only impact newly created workers. Any preexisting workers will remain on their +old image unti they get manually deleted, or gradually replaced through normal system operation. + +#### Rollback + +Rollback is fortunately easy: simply revert the version in `batch/batch/cloud/gcp/driver/create_instance.py` and +redeploy. You'll need to delete alkl ## Key files From 4e4651eaa55b478da484389c7a5981c32636643d Mon Sep 17 00:00:00 2001 From: Chris Llanwarne Date: Thu, 30 Apr 2026 17:41:09 -0400 Subject: [PATCH 6/9] Revise worker VM image build instructions Updated instructions for building worker VM image, including new steps for identifying a base image and monitoring the build process. --- dev-docs/services/batch/worker-vm-image.md | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/dev-docs/services/batch/worker-vm-image.md b/dev-docs/services/batch/worker-vm-image.md index 0ac79831cdb..252e7422088 100644 --- a/dev-docs/services/batch/worker-vm-image.md +++ b/dev-docs/services/batch/worker-vm-image.md @@ -73,14 +73,17 @@ the disk. 1. Increment `WORKER_IMAGE_VERSION` in `batch/gcp-create-worker-image.sh` - Very Important! You MUST increment the version before running the script! The version is part of the image name, so running without doing this would replace the current image relied on in prod. -2. Run the build script with a custom NAMESPACE, to make sure the image builds and deploys successfully. +2. Take this opportunity to identify a suitable new base image (`UBUNTU_IMAGE` - look in the image list in GCP console for the latest). +3. Run the build script with a custom NAMESPACE, to make sure the image builds and deploys successfully. - eg: `NAMESPACE=YOURNAME batch/gcp-create-worker-image.sh` -4. Update the hardcoded image reference in +4. Monitor the build process: open the VM list in gcloud console. find the build worker you just triggered. Under the three dots +open Monitoring and watch the logs. +5. Update the hardcoded image reference in `batch/batch/cloud/gcp/driver/create_instance.py` (search for `batch-worker-`) -5. Test deploy, and check things look good. -6. Repeat with `NAMESPACE=default` -7. Create PR, watch CI, deploy with the updated image. -8. NOTE: The rollout will only impact newly created workers. Any preexisting workers will remain on their +6. Test deploy, and check things look good. +7. Repeat with `NAMESPACE=default` +8. Create PR, watch CI, deploy with the updated image. +9. NOTE: The rollout will only impact newly created workers. Any preexisting workers will remain on their old image unti they get manually deleted, or gradually replaced through normal system operation. #### Rollback From 26950b5c31b9441ca082027697ed1b2d27cc1c49 Mon Sep 17 00:00:00 2001 From: Chris Llanwarne Date: Fri, 1 May 2026 17:10:45 -0400 Subject: [PATCH 7/9] Apply suggestions from code review Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com> --- dev-docs/services/batch/worker-vm-image.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/dev-docs/services/batch/worker-vm-image.md b/dev-docs/services/batch/worker-vm-image.md index 252e7422088..b2e96458609 100644 --- a/dev-docs/services/batch/worker-vm-image.md +++ b/dev-docs/services/batch/worker-vm-image.md @@ -84,12 +84,12 @@ open Monitoring and watch the logs. 7. Repeat with `NAMESPACE=default` 8. Create PR, watch CI, deploy with the updated image. 9. NOTE: The rollout will only impact newly created workers. Any preexisting workers will remain on their -old image unti they get manually deleted, or gradually replaced through normal system operation. +old image until they get manually deleted, or gradually replaced through normal system operation. #### Rollback Rollback is fortunately easy: simply revert the version in `batch/batch/cloud/gcp/driver/create_instance.py` and -redeploy. You'll need to delete alkl +redeploy. You'll need to delete all existing workers for the rollback to take effect. ## Key files From bd1b65bfc46c2b0e4014c6a7b7b18ab4c413b16c Mon Sep 17 00:00:00 2001 From: Chris Llanwarne Date: Fri, 1 May 2026 17:13:10 -0400 Subject: [PATCH 8/9] Enhance version incrementing procedure in documentation Added steps for identifying new base images and checking GPU drivers during version incrementing. --- dev-docs/services/batch/worker-vm-image.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/dev-docs/services/batch/worker-vm-image.md b/dev-docs/services/batch/worker-vm-image.md index b2e96458609..8c5d09ffa55 100644 --- a/dev-docs/services/batch/worker-vm-image.md +++ b/dev-docs/services/batch/worker-vm-image.md @@ -70,10 +70,17 @@ the disk. ### Incrementing the version +#### At the same time + +1. Take this opportunity to identify a suitable new base image (`UBUNTU_IMAGE` - look in the image list in GCP console for the latest). +2. Also check whether there are new nvidia GPU drivers (I've seen out of date drivers cause the build to hang indefinitely if they don't + match the OS version) + +#### Procedure to Increment + 1. Increment `WORKER_IMAGE_VERSION` in `batch/gcp-create-worker-image.sh` - Very Important! You MUST increment the version before running the script! The version is part of the image name, so running without doing this would replace the current image relied on in prod. -2. Take this opportunity to identify a suitable new base image (`UBUNTU_IMAGE` - look in the image list in GCP console for the latest). 3. Run the build script with a custom NAMESPACE, to make sure the image builds and deploys successfully. - eg: `NAMESPACE=YOURNAME batch/gcp-create-worker-image.sh` 4. Monitor the build process: open the VM list in gcloud console. find the build worker you just triggered. Under the three dots From a2919c4214ed5cc379ca8719f4736479e8877f6b Mon Sep 17 00:00:00 2001 From: Chris Llanwarne Date: Fri, 1 May 2026 17:19:28 -0400 Subject: [PATCH 9/9] Fix numbering in worker VM image documentation --- dev-docs/services/batch/worker-vm-image.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/dev-docs/services/batch/worker-vm-image.md b/dev-docs/services/batch/worker-vm-image.md index 8c5d09ffa55..12f9a5ba540 100644 --- a/dev-docs/services/batch/worker-vm-image.md +++ b/dev-docs/services/batch/worker-vm-image.md @@ -81,16 +81,16 @@ the disk. 1. Increment `WORKER_IMAGE_VERSION` in `batch/gcp-create-worker-image.sh` - Very Important! You MUST increment the version before running the script! The version is part of the image name, so running without doing this would replace the current image relied on in prod. -3. Run the build script with a custom NAMESPACE, to make sure the image builds and deploys successfully. +2. Run the build script with a custom NAMESPACE, to make sure the image builds and deploys successfully. - eg: `NAMESPACE=YOURNAME batch/gcp-create-worker-image.sh` -4. Monitor the build process: open the VM list in gcloud console. find the build worker you just triggered. Under the three dots +3. Monitor the build process: open the VM list in gcloud console. find the build worker you just triggered. Under the three dots open Monitoring and watch the logs. -5. Update the hardcoded image reference in +4. Update the hardcoded image reference in `batch/batch/cloud/gcp/driver/create_instance.py` (search for `batch-worker-`) -6. Test deploy, and check things look good. -7. Repeat with `NAMESPACE=default` -8. Create PR, watch CI, deploy with the updated image. -9. NOTE: The rollout will only impact newly created workers. Any preexisting workers will remain on their +5. Test deploy, and check things look good. +6. Repeat with `NAMESPACE=default` +7. Create PR, watch CI, deploy with the updated image. +8. NOTE: The rollout will only impact newly created workers. Any preexisting workers will remain on their old image until they get manually deleted, or gradually replaced through normal system operation. #### Rollback