-
Notifications
You must be signed in to change notification settings - Fork 267
[docs] Document the purpose of, and how to update, the worker VM base image #15272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
cjllanwarne
wants to merge
9
commits into
hail-is:main
Choose a base branch
from
cjllanwarne:cjl_worker_image_doc
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+108
−0
Open
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
6d301d3
WIP
cjllanwarne e71aa1e
feedback
cjllanwarne 9364da4
separate first time from subsequent
cjllanwarne deca731
Clarify version incrementing steps for image creation
cjllanwarne 8556737
Update worker VM image documentation steps
cjllanwarne 4e4651e
Revise worker VM image build instructions
cjllanwarne 26950b5
Apply suggestions from code review
cjllanwarne bd1b65b
Enhance version incrementing procedure in documentation
cjllanwarne a2919c4
Fix numbering in worker VM image documentation
cjllanwarne File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| # Batch Worker VM Image | ||
|
|
||
| ## Background | ||
|
|
||
| Batch worker VMs use a two-layer image system: | ||
|
|
||
| 1. **VM image** (this doc) -- A GCE VM image with the base OS, Docker, GPU drivers, and the root | ||
| Docker image pre-pulled. This is what each worker VM boots from. | ||
|
|
||
| 2. **Docker worker image** (`Dockerfile.worker`) -- The container that actually runs the batch | ||
| worker process. It's built separately via `make batch-worker-image` and pulled onto workers at | ||
| startup. | ||
|
|
||
| The VM image exists primarily so that worker VMs boot fast. Without it, every new worker would | ||
| need to install Docker, pull a ~2GB root image, install GPU drivers, etc. on every boot. By baking | ||
| all of that into a VM image, new workers go from creation to running jobs in under a minute. | ||
|
|
||
| The VM image changes rarely (only when we need to update Docker, GPU drivers, or other | ||
| OS-level dependencies). The Docker worker image changes on every deploy. | ||
|
|
||
| ### How the build works | ||
|
|
||
| The build scripts create a temporary VM from a stock Ubuntu image, run a startup script that | ||
| installs everything, wait for the VM to shut itself down, then snapshot its disk into a reusable | ||
| image. The temporary VM is deleted afterward. | ||
|
|
||
| ## GCP | ||
|
|
||
| ### First time creation | ||
|
|
||
| If this is a brand new deployment, the worker image must be created before Batch can be deployed. | ||
| See `infra/gcp/README.md` — the image creation step comes after `deploy_unmanaged` and before | ||
| downloading global-config. | ||
|
|
||
| #### Prerequisites | ||
|
|
||
| - `gcloud` configured and authenticated with the target project | ||
| - `NAMESPACE` environment variable set (usually `default`) | ||
| - Access to the project's global-config (for project ID, zone, docker root image) | ||
|
|
||
| #### Building the image | ||
|
|
||
| From the repo root: | ||
|
|
||
| ```bash | ||
| NAMESPACE=default batch/gcp-create-worker-image.sh | ||
| ``` | ||
|
|
||
| The script will show a confirmation prompt with the image name, version, project, and zone before | ||
| proceeding. | ||
|
|
||
| #### What gets installed (startup script) | ||
|
|
||
| The GCP startup script (`build-batch-worker-image-startup-gcp.sh`) installs: | ||
|
|
||
| - Google logging agent and Cloud Ops agent | ||
| - Docker CE + `docker-credential-gcr` | ||
| - NVIDIA drivers (535.183.01) and `nvidia-container-toolkit` — baked into every worker image, but | ||
| only activated at runtime on VMs with GPUs attached (see `is_gpu()` check in `worker.py`) | ||
| - Build tools (gcc-12, g++-12) | ||
| - Pre-pulls the `docker_root_image` from Artifact Registry | ||
|
|
||
| The VM shuts itself down when the script completes. The build script polls for this, then snapshots | ||
| the disk. | ||
|
|
||
| #### Image naming | ||
|
|
||
| - `default` namespace: `batch-worker-{VERSION}` (e.g. `batch-worker-17`) | ||
| - Other namespaces: `batch-worker-{NAMESPACE}-{VERSION}` | ||
|
|
||
| ### Incrementing the version | ||
|
|
||
| #### At the same time | ||
|
|
||
| 1. Take this opportunity to identify a suitable new base image (`UBUNTU_IMAGE` - look in the image list in GCP console for the latest). | ||
| 2. Also check whether there are new nvidia GPU drivers (I've seen out of date drivers cause the build to hang indefinitely if they don't | ||
| match the OS version) | ||
|
|
||
| #### Procedure to Increment | ||
|
|
||
| 1. Increment `WORKER_IMAGE_VERSION` in `batch/gcp-create-worker-image.sh` | ||
| - Very Important! You MUST increment the version before running the script! The version is part of | ||
| the image name, so running without doing this would replace the current image relied on in prod. | ||
| 2. Run the build script with a custom NAMESPACE, to make sure the image builds and deploys successfully. | ||
| - eg: `NAMESPACE=YOURNAME batch/gcp-create-worker-image.sh` | ||
| 3. Monitor the build process: open the VM list in gcloud console. find the build worker you just triggered. Under the three dots | ||
| open Monitoring and watch the logs. | ||
| 4. Update the hardcoded image reference in | ||
| `batch/batch/cloud/gcp/driver/create_instance.py` (search for `batch-worker-`) | ||
| 5. Test deploy, and check things look good. | ||
| 6. Repeat with `NAMESPACE=default` | ||
| 7. Create PR, watch CI, deploy with the updated image. | ||
| 8. NOTE: The rollout will only impact newly created workers. Any preexisting workers will remain on their | ||
| old image until they get manually deleted, or gradually replaced through normal system operation. | ||
|
|
||
| #### Rollback | ||
|
|
||
| Rollback is fortunately easy: simply revert the version in `batch/batch/cloud/gcp/driver/create_instance.py` and | ||
| redeploy. You'll need to delete all existing workers for the rollback to take effect. | ||
|
|
||
| ## Key files | ||
|
|
||
| | File | Purpose | | ||
| |------|---------| | ||
| | `batch/gcp-create-worker-image.sh` | GCP build orchestration script | | ||
| | `batch/build-batch-worker-image-startup-gcp.sh` | GCP VM startup/provisioning (Jinja2 template) | | ||
| | `batch/Dockerfile.worker` | Docker worker image (separate from the VM image) | | ||
| | `batch/batch/cloud/gcp/driver/create_instance.py` | Runtime: creates worker VMs using the GCP image | |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This comment was marked as resolved.
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.