[batch] Upgrade pooled vms from n1 to n2 machines#15497
Conversation
|
🎉 |
| # Setup ops agent before anything else so startup failures are visible in Cloud Logging | ||
| touch /worker.log | ||
| touch /run.log |
There was a problem hiding this comment.
We should probably respect this by putting this above the SSD wrangling above
There was a problem hiding this comment.
Out of date comment... see the main one below
| # combine multiple local SSDs into a single RAID0 array | ||
| if [ "$NUM_LOCAL_SSDS" -gt 1 ]; then | ||
| DEVICES="" | ||
| for i in $(seq 1 $NUM_LOCAL_SSDS); do | ||
| DEVICES="$DEVICES /dev/nvme0n$i" | ||
| done | ||
| mdadm --create /dev/md0 --level=0 --raid-devices=$NUM_LOCAL_SSDS $DEVICES --force --run | ||
| fi |
There was a problem hiding this comment.
The intent here seems to be to add RAID formatting before the remaining stuff (format worker disk, configure docker, etc). But maybe during a rebase, this has ended up duplicating content that is now on line 299+
I suspect (given the ops agent stuff I noticed below...) that my changes to support ubuntu 24 came in, moved some stuff around, and the rebase didn't notice that it needed to be careful and just duplicated all this content here instead.
NB: There's also a regression in this copied version, because it doesn't include the switching of containerd to the local ssd that I had to add for ubuntu 24
| **n2_highcpu_machines, | ||
| 'n2-highmem-128': MachineTypeParts( | ||
| cores=128, | ||
| memory=gib_to_bytes(864), |
There was a problem hiding this comment.
Correct. Not sure why/how it ends up as 864 or why it breaks the 8x{cores} pattern, but that's what it is https://docs.cloud.google.com/compute/docs/general-purpose-machines#n2-high-mem
| gpu_config=None, | ||
| machine_family='n2', | ||
| worker_type='highmem', | ||
| ), |
There was a problem hiding this comment.
n2-highcpu tops out at 96GB https://docs.cloud.google.com/compute/docs/general-purpose-machines#n2-high-cpu
| 'standard': [1, 2, 4, 8, 16, 32, 64, 96], | ||
| 'highmem': [2, 4, 8, 16, 32, 64, 96], | ||
| 'standard': [2, 4, 8, 16, 32, 48, 64, 80, 96, 128], | ||
| 'highmem': [2, 4, 8, 16, 32, 48, 64, 80, 96], |
There was a problem hiding this comment.
You have a highmem 128 listed above, but not here?
| return 1 | ||
|
|
||
|
|
||
| def gcp_local_ssd_size(machine_family: str, cores: int) -> int: |
There was a problem hiding this comment.
This was updated to work for n2. Is it still valid when people ask for custom machine types that are not n2 (g4, n1, e5, etc?)
There was a problem hiding this comment.
Yup -- We previously hardcoded in 375 GB for one SSD, but now we're gonna explicitly handle >1 SSD count in N2s via the new gcp_local_ssd_count().
There was a problem hiding this comment.
And "375 GB" is now a constant (GCP_LOCAL_SSD_PARTITION_SIZE) rather than a hardcoded value https://github.com/grohli/hail/blob/d646fb3f663bfcfcdd0b2bb0292ff4fb20a32b4a/batch/batch/cloud/gcp/resource_utils.py#L346
| assert int(machine_parts.memory / machine_parts.cores / 1024**2) == 8192 | ||
|
|
||
|
|
||
| def test_gcp_local_ssd_count(): |
There was a problem hiding this comment.
Not a huge deal since they're so short anyway, but if you use pytest parameterized test here you'd get a pass/failure per case, and each case would run even if the preceding one failed. Also true for test_gcp_local_ssd_size below.
|
|
||
| if cloud == 'gcp': | ||
| return 'n1-standard-1' | ||
| return 'n2-standard-2' |
There was a problem hiding this comment.
I think this is for custom machine tests? In which case I think n1-standard-1 is still valid and still smaller than n2-standard-2?
| if [ "$NUM_LOCAL_SSDS" -gt 1 ]; then | ||
| DEVICES="" | ||
| for i in $(seq 1 $NUM_LOCAL_SSDS); do | ||
| DEVICES="$DEVICES /dev/nvme0n$i" | ||
| done | ||
| mdadm --create /dev/md0 --level=0 --raid-devices=$NUM_LOCAL_SSDS $DEVICES --force --run | ||
| fi |
There was a problem hiding this comment.
The RAID script constructs device paths incorrectly for N2 machines. The loop seq 1 $NUM_LOCAL_SSDS generates indices 1, 2, 3... which creates device names /dev/nvme0n1, /dev/nvme0n2, /dev/nvme0n3. However, when multiple local SSDs are attached to a GCP instance, they appear as /dev/nvme0n1, /dev/nvme0n2, etc. BUT the script is missing sudo for the mdadm command, and more critically, NVMe device indices in GCP start from a specific offset depending on the boot disk. The script should verify device existence before use.
if [ "$NUM_LOCAL_SSDS" -gt 1 ]; then
DEVICES=""
for i in $(seq 0 $(($NUM_LOCAL_SSDS - 1))); do
DEVICE="/dev/nvme0n$((i+1))"
# Wait for device to be ready
while [ ! -e "$DEVICE" ]; do sleep 1; done
DEVICES="$DEVICES $DEVICE"
done
sudo mdadm --create /dev/md0 --level=0 --raid-devices=$NUM_LOCAL_SSDS $DEVICES --force --run
fiThe missing sudo will cause the mdadm command to fail with permission denied.
Spotted by Graphite
Is this helpful? React 👍 or 👎 to let us know.
Switch GCP_MACHINE_FAMILY from 'n1' to 'n2' so all pool workers create N2 VMs. Add N2 memory ratios, machine type entries (standard/highmem/highcpu), and valid core counts. Update pricing pipeline to recognize N2 SKUs and handle N2 memory SKU descriptions. Existing N1 entries retained for GPU JPIM billing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…partitions and RAID0 when >1 N2 machines require local SSDs in specific quantities (e.g., n2-standard-16 requires minimum 2). The previous code always attached exactly 1, which GCP rejects. This adds a per-core-count lookup for the minimum SSD count, attaches that many SCRATCH disks in the VM config, and combines them via mdadm RAID0 in the startup script when count > 1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Corrected the N2_MIN_LOCAL_SSD_COUNT_BY_CORES lookup table using the actual values from GCP documentation. Four entries were underestimated: 32-core (2->4), 48-core (4->8), 64-core (4->8), 96-core (8->16). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge duplicate imports from batch.cloud.gcp.resource_utils into a single statement to satisfy ruff's isort rules after merge with upstream main. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split aliased import from regular imports per ruff 0.11.13's isort rules. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix create_instance.py merge conflict: remove duplicated disk/docker block, keep containerd version for ubuntu 24, move ops agent setup before SSD wrangling, position RAID0 assembly correctly - Revert smallest_machine_type to n1-standard-1 for custom machine tests - Parameterize test_gcp_local_ssd_count and test_gcp_local_ssd_size - Add 128 to highmem valid cores to match n2-highmem-128 in MACHINE_TYPE_TO_PARTS - Combine duplicate imports in test_utils.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
23a8ebb to
8a8771f
Compare
Change Description
This PR upgrades the GCP pooled worker VM machine family from N1 to N2. As an implicit part of this PR, N2-family machines are also now available for
job_privatejobs users may submit.Billing
SKU and pricing information for N2 resources are added to the resources database as part of this PR. Testing has shown that billing is accurate given resource usage of N2 pooled VMs.
Technical difference from N1 machines
Local SSDs: N2 machines require a minimum number of local SSD partitions that varies by vCPU count (e.g. 16 cores requires 2, 32 requires 4, up to 16 partitions for 96+ cores). N1 always used a single 375 GiB partition. The VM creation now attaches the correct number of partitions and combines them into a RAID0 array in the startup script when more than one is needed.
High-end Core:Mem ratio
n2-highmem-128has a non-standard memory ratio (864 GiB instead of 8 * 128 = 1024 GiB) and is defined as an explicit override.Validation:
SQL output showing the
resourcestable in a dev deploy properly populated with N2 pricing information:Test job using VM from pool on a deployment of this branch, showing that it's successfully used a pooled, preemptible n2 machine:


Raw SQL output corresponding to the above, in case you don't want to trust a bunch of "<$0.0001" from the UI:
Security Assessment
Delete all except the correct answer:
Impact Rating
Delete all except the correct answer:
Impact Description
This change does not give users any new permissions, etc. when using Hail, but it introduces a new VM type on which users can run their jobs. Additionally, this phases out the n1 family as our default pooled VM resource and fully replaces it with n2 in the pool.
Appsec Review