[batch] Upgrade pooled vms from n1 to n2 machines by grohli · Pull Request #15497 · hail-is/hail

grohli · 2026-05-22T19:20:24Z

Change Description

This PR upgrades the GCP pooled worker VM machine family from N1 to N2. As an implicit part of this PR, N2-family machines are also now available for job_private jobs users may submit.

Billing

SKU and pricing information for N2 resources are added to the resources database as part of this PR. Testing has shown that billing is accurate given resource usage of N2 pooled VMs.

Technical difference from N1 machines

Local SSDs: N2 machines require a minimum number of local SSD partitions that varies by vCPU count (e.g. 16 cores requires 2, 32 requires 4, up to 16 partitions for 96+ cores). N1 always used a single 375 GiB partition. The VM creation now attaches the correct number of partitions and combines them into a RAID0 array in the startup script when more than one is needed.

High-end Core:Mem ratio n2-highmem-128 has a non-standard memory ratio (864 GiB instead of 8 * 128 = 1024 GiB) and is defined as an explicit override.

Validation:

SQL output showing the resources table in a dev deploy properly populated with N2 pricing information:

mysql> select * from resources where resource like '%n2%';
+-----------------------------------------------------+---------------------------------+-------------+---------------------+
| resource                                            | rate                            | resource_id | deduped_resource_id |
+-----------------------------------------------------+---------------------------------+-------------+---------------------+
| compute/n2-nonpreemptible/us-central1/1779433200000 |   0.000000000008780833333333336 |         152 |                 152 |
| compute/n2-nonpreemptible/us-east1/1779433200000    |   0.000000000008780833333333336 |         153 |                 153 |
| compute/n2-nonpreemptible/us-east4/1779433200000    |   0.000000000009890277777777778 |          85 |                  85 |
| compute/n2-nonpreemptible/us-west1/1779433200000    |   0.000000000008780833333333336 |         154 |                 154 |
| compute/n2-nonpreemptible/us-west2/1779433200000    |   0.000000000010547222222222221 |         134 |                 134 |
| compute/n2-nonpreemptible/us-west3/1779433200000    |   0.000000000010547222222222221 |         105 |                 105 |
| compute/n2-nonpreemptible/us-west4/1779433200000    |   0.000000000009889722222222222 |          84 |                  84 |
| compute/n2-preemptible/us-central1/1779433200000    |   0.000000000002841666666666667 |          42 |                  42 |
| compute/n2-preemptible/us-east1/1779433200000       |   0.000000000002841666666666667 |          43 |                  43 |
| compute/n2-preemptible/us-east4/1779433200000       |   0.000000000002061111111111111 |         118 |                 118 |
| compute/n2-preemptible/us-west1/1779433200000       |   0.000000000002841666666666667 |          44 |                  44 |
| compute/n2-preemptible/us-west2/1779433200000       |  0.0000000000029750000000000006 |         162 |                 162 |
| compute/n2-preemptible/us-west3/1779433200000       |   0.000000000003769444444444445 |          58 |                  58 |
| compute/n2-preemptible/us-west4/1779433200000       |   0.000000000004400000000000001 |         146 |                 146 |
| memory/n2-nonpreemptible/us-central1/1779433200000  |  0.0000000000011493598090277779 |          93 |                  93 |
| memory/n2-nonpreemptible/us-east1/1779433200000     |  0.0000000000011493598090277779 |          94 |                  94 |
| memory/n2-nonpreemptible/us-east4/1779433200000     |  0.0000000000012942165798611112 |         142 |                 142 |
| memory/n2-nonpreemptible/us-west1/1779433200000     |  0.0000000000011493598090277779 |          95 |                  95 |
| memory/n2-nonpreemptible/us-west2/1779433200000     |  0.0000000000013804796006944447 |          45 |                  45 |
| memory/n2-nonpreemptible/us-west3/1779433200000     |  0.0000000000013804796006944447 |         164 |                 164 |
| memory/n2-nonpreemptible/us-west4/1779433200000     |  0.0000000000012942165798611112 |         201 |                 201 |
| memory/n2-preemptible/us-central1/1779433200000     |  0.0000000000003716362847222223 |          98 |                  98 |
| memory/n2-preemptible/us-east1/1779433200000        |  0.0000000000003716362847222223 |          99 |                  99 |
| memory/n2-preemptible/us-east4/1779433200000        | 0.00000000000027045355902777777 |         136 |                 136 |
| memory/n2-preemptible/us-west1/1779433200000        |  0.0000000000003716362847222223 |         100 |                 100 |
| memory/n2-preemptible/us-west2/1779433200000        |  0.0000000000003911675347222222 |         140 |                 140 |
| memory/n2-preemptible/us-west3/1779433200000        |  0.0000000000004939778645833333 |          97 |                  97 |
| memory/n2-preemptible/us-west4/1779433200000        |  0.0000000000005750868055555555 |         128 |                 128 |
+-----------------------------------------------------+---------------------------------+-------------+---------------------+

Test job using VM from pool on a deployment of this branch, showing that it's successfully used a pooled, preemptible n2 machine:

Raw SQL output corresponding to the above, in case you don't want to trust a bunch of "<$0.0001" from the UI:

mysql> select r.* from resources r inner join attempt_resources a on r.deduped_resource_id = a.deduped_resource_id;
+------------------------------------------------------+----------------------------------+-------------+---------------------+
| resource                                             | rate                             | resource_id | deduped_resource_id |
+------------------------------------------------------+----------------------------------+-------------+---------------------+
| service-fee/1                                        |    0.000000000002777777777777778 |           9 |                   9 |
| gcp-support-logs-specs-and-firewall-fees/1           |    0.000000000001388888888888889 |          10 |                  10 |
| ip-fee/preemptible/1024/1                            |   0.0000000000006781684027777778 |          11 |                  11 |
| compute/n2-preemptible/us-central1/1779433200000     |    0.000000000002841666666666667 |          42 |                  42 |
| memory/n2-preemptible/us-central1/1779433200000      |   0.0000000000003716362847222223 |          98 |                  98 |
| disk/pd-ssd/us-central1/1779433200000                |  0.00000000000006312861244201081 |         147 |                 147 |
| disk/local-ssd/preemptible/us-central1/1779433200000 | 0.000000000000010843267548863032 |         184 |                 184 |
+------------------------------------------------------+----------------------------------+-------------+---------------------+

Security Assessment

Delete all except the correct answer:

This change potentially impacts the Hail Batch instance as deployed by Broad Institute in GCP
- The Impact Rating, Impact Description, and Appsec Review sections are required

Impact Rating

Delete all except the correct answer:

This change has a medium security impact

Impact Description

This change does not give users any new permissions, etc. when using Hail, but it introduces a new VM type on which users can run their jobs. Additionally, this phases out the n1 family as our default pooled VM resource and fully replaces it with n2 in the pool.

Appsec Review

Required: The impact has been assessed and approved by appsec

ehigham · 2026-05-26T19:35:42Z

🎉

cjllanwarne · 2026-05-27T19:20:58Z

 # Setup ops agent before anything else so startup failures are visible in Cloud Logging
 touch /worker.log
 touch /run.log


We should probably respect this by putting this above the SSD wrangling above

Out of date comment... see the main one below

cjllanwarne · 2026-05-27T19:29:42Z

+# combine multiple local SSDs into a single RAID0 array
+if [ "$NUM_LOCAL_SSDS" -gt 1 ]; then
+    DEVICES=""
+    for i in $(seq 1 $NUM_LOCAL_SSDS); do
+        DEVICES="$DEVICES /dev/nvme0n$i"
+    done
+    mdadm --create /dev/md0 --level=0 --raid-devices=$NUM_LOCAL_SSDS $DEVICES --force --run
+fi


The intent here seems to be to add RAID formatting before the remaining stuff (format worker disk, configure docker, etc). But maybe during a rebase, this has ended up duplicating content that is now on line 299+

I suspect (given the ops agent stuff I noticed below...) that my changes to support ubuntu 24 came in, moved some stuff around, and the rebase didn't notice that it needed to be careful and just duplicated all this content here instead.

NB: There's also a regression in this copied version, because it doesn't include the switching of containerd to the local ssd that I had to add for ubuntu 24

cjllanwarne · 2026-05-27T19:35:58Z

+    **n2_highcpu_machines,
+    'n2-highmem-128': MachineTypeParts(
+        cores=128,
+        memory=gib_to_bytes(864),


Correct. Not sure why/how it ends up as 864 or why it breaks the 8x{cores} pattern, but that's what it is https://docs.cloud.google.com/compute/docs/general-purpose-machines#n2-high-mem

cjllanwarne · 2026-05-27T19:36:13Z

+        gpu_config=None,
+        machine_family='n2',
+        worker_type='highmem',
+    ),


No n2-highcpu-128?

n2-highcpu tops out at 96GB https://docs.cloud.google.com/compute/docs/general-purpose-machines#n2-high-cpu

cjllanwarne · 2026-05-27T19:36:38Z

-    'standard': [1, 2, 4, 8, 16, 32, 64, 96],
-    'highmem': [2, 4, 8, 16, 32, 64, 96],
+    'standard': [2, 4, 8, 16, 32, 48, 64, 80, 96, 128],
+    'highmem': [2, 4, 8, 16, 32, 48, 64, 80, 96],


You have a highmem 128 listed above, but not here?

cjllanwarne · 2026-05-27T19:38:11Z

+    return 1
+
+
+def gcp_local_ssd_size(machine_family: str, cores: int) -> int:


This was updated to work for n2. Is it still valid when people ask for custom machine types that are not n2 (g4, n1, e5, etc?)

Yup -- We previously hardcoded in 375 GB for one SSD, but now we're gonna explicitly handle >1 SSD count in N2s via the new gcp_local_ssd_count().

And "375 GB" is now a constant (GCP_LOCAL_SSD_PARTITION_SIZE) rather than a hardcoded value https://github.com/grohli/hail/blob/d646fb3f663bfcfcdd0b2bb0292ff4fb20a32b4a/batch/batch/cloud/gcp/resource_utils.py#L346

cjllanwarne · 2026-05-27T19:39:29Z

            assert int(machine_parts.memory / machine_parts.cores / 1024**2) == 8192


+def test_gcp_local_ssd_count():


Not a huge deal since they're so short anyway, but if you use pytest parameterized test here you'd get a pass/failure per case, and each case would run even if the preceding one failed. Also true for test_gcp_local_ssd_size below.

cjllanwarne · 2026-05-27T19:40:10Z


    if cloud == 'gcp':
-        return 'n1-standard-1'
+        return 'n2-standard-2'


I think this is for custom machine tests? In which case I think n1-standard-1 is still valid and still smaller than n2-standard-2?

graphite-app · 2026-05-28T19:15:18Z

+if [ "$NUM_LOCAL_SSDS" -gt 1 ]; then
+    DEVICES=""
+    for i in $(seq 1 $NUM_LOCAL_SSDS); do
+        DEVICES="$DEVICES /dev/nvme0n$i"
+    done
+    mdadm --create /dev/md0 --level=0 --raid-devices=$NUM_LOCAL_SSDS $DEVICES --force --run
+fi


The RAID script constructs device paths incorrectly for N2 machines. The loop seq 1 $NUM_LOCAL_SSDS generates indices 1, 2, 3... which creates device names /dev/nvme0n1, /dev/nvme0n2, /dev/nvme0n3. However, when multiple local SSDs are attached to a GCP instance, they appear as /dev/nvme0n1, /dev/nvme0n2, etc. BUT the script is missing sudo for the mdadm command, and more critically, NVMe device indices in GCP start from a specific offset depending on the boot disk. The script should verify device existence before use.

if [ "$NUM_LOCAL_SSDS" -gt 1 ]; then DEVICES="" for i in $(seq 0 $(($NUM_LOCAL_SSDS - 1))); do DEVICE="/dev/nvme0n$((i+1))" # Wait for device to be ready while [ ! -e "$DEVICE" ]; do sleep 1; done DEVICES="$DEVICES $DEVICE" done sudo mdadm --create /dev/md0 --level=0 --raid-devices=$NUM_LOCAL_SSDS $DEVICES --force --run fi

The missing sudo will cause the mdadm command to fail with permission denied.

Spotted by Graphite

Is this helpful? React 👍 or 👎 to let us know.

Switch GCP_MACHINE_FAMILY from 'n1' to 'n2' so all pool workers create N2 VMs. Add N2 memory ratios, machine type entries (standard/highmem/highcpu), and valid core counts. Update pricing pipeline to recognize N2 SKUs and handle N2 memory SKU descriptions. Existing N1 entries retained for GPU JPIM billing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…partitions and RAID0 when >1 N2 machines require local SSDs in specific quantities (e.g., n2-standard-16 requires minimum 2). The previous code always attached exactly 1, which GCP rejects. This adds a per-core-count lookup for the minimum SSD count, attaches that many SCRATCH disks in the VM config, and combines them via mdadm RAID0 in the startup script when count > 1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Corrected the N2_MIN_LOCAL_SSD_COUNT_BY_CORES lookup table using the actual values from GCP documentation. Four entries were underestimated: 32-core (2->4), 48-core (4->8), 64-core (4->8), 96-core (8->16). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge duplicate imports from batch.cloud.gcp.resource_utils into a single statement to satisfy ruff's isort rules after merge with upstream main. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Split aliased import from regular imports per ruff 0.11.13's isort rules. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix create_instance.py merge conflict: remove duplicated disk/docker block, keep containerd version for ubuntu 24, move ops agent setup before SSD wrangling, position RAID0 assembly correctly - Revert smallest_machine_type to n1-standard-1 for custom machine tests - Parameterize test_gcp_local_ssd_count and test_gcp_local_ssd_size - Add 128 to highmem valid cores to match n2-highmem-128 in MACHINE_TYPE_TO_PARTS - Combine duplicate imports in test_utils.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

grohli requested a review from a team as a code owner May 22, 2026 19:20

grohli requested review from cjllanwarne and kush-chandra May 22, 2026 19:20

grohli changed the title ~~Upgrade hail pooled vms from n1 to n2 machines~~ [batch] Upgrade pooled vms from n1 to n2 machines May 26, 2026

graphite-app Bot reviewed May 26, 2026

View reviewed changes

Comment thread batch/test/test_utils.py Outdated

cjllanwarne requested changes May 27, 2026

View reviewed changes

graphite-app Bot reviewed May 28, 2026

View reviewed changes

grohli and others added 7 commits June 1, 2026 11:32

[batch] Fix ruff I001 import sorting in test_utils.py

fefae77

Merge duplicate imports from batch.cloud.gcp.resource_utils into a single statement to satisfy ruff's isort rules after merge with upstream main. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[batch] Fix ruff I001 import sorting for ruff 0.11.13

64867f7

Split aliased import from regular imports per ruff 0.11.13's isort rules. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[batch] Fix ruff format long assert line in resource_manager.py

54f2dcb

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

grohli force-pushed the upgrade-hail-pooled-vms-from-n1-to-n2-machines branch from 23a8ebb to 8a8771f Compare June 1, 2026 15:38

		return 1


		def gcp_local_ssd_size(machine_family: str, cores: int) -> int:

		assert int(machine_parts.memory / machine_parts.cores / 1024**2) == 8192


		def test_gcp_local_ssd_count():

Conversation

grohli commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Description

Billing

Technical difference from N1 machines

Validation:

Security Assessment

Impact Rating

Impact Description

Appsec Review

Uh oh!

Uh oh!

ehigham commented May 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

graphite-app Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

grohli commented May 22, 2026 •

edited

Loading