Skip to content

[batch] Upgrade pooled vms from n1 to n2 machines#15497

Open
grohli wants to merge 7 commits into
hail-is:mainfrom
grohli:upgrade-hail-pooled-vms-from-n1-to-n2-machines
Open

[batch] Upgrade pooled vms from n1 to n2 machines#15497
grohli wants to merge 7 commits into
hail-is:mainfrom
grohli:upgrade-hail-pooled-vms-from-n1-to-n2-machines

Conversation

@grohli
Copy link
Copy Markdown
Contributor

@grohli grohli commented May 22, 2026

Change Description

This PR upgrades the GCP pooled worker VM machine family from N1 to N2. As an implicit part of this PR, N2-family machines are also now available for job_private jobs users may submit.

Billing

SKU and pricing information for N2 resources are added to the resources database as part of this PR. Testing has shown that billing is accurate given resource usage of N2 pooled VMs.

Technical difference from N1 machines

Local SSDs: N2 machines require a minimum number of local SSD partitions that varies by vCPU count (e.g. 16 cores requires 2, 32 requires 4, up to 16 partitions for 96+ cores). N1 always used a single 375 GiB partition. The VM creation now attaches the correct number of partitions and combines them into a RAID0 array in the startup script when more than one is needed.

High-end Core:Mem ratio n2-highmem-128 has a non-standard memory ratio (864 GiB instead of 8 * 128 = 1024 GiB) and is defined as an explicit override.

Validation:

SQL output showing the resources table in a dev deploy properly populated with N2 pricing information:

mysql> select * from resources where resource like '%n2%';
+-----------------------------------------------------+---------------------------------+-------------+---------------------+
| resource                                            | rate                            | resource_id | deduped_resource_id |
+-----------------------------------------------------+---------------------------------+-------------+---------------------+
| compute/n2-nonpreemptible/us-central1/1779433200000 |   0.000000000008780833333333336 |         152 |                 152 |
| compute/n2-nonpreemptible/us-east1/1779433200000    |   0.000000000008780833333333336 |         153 |                 153 |
| compute/n2-nonpreemptible/us-east4/1779433200000    |   0.000000000009890277777777778 |          85 |                  85 |
| compute/n2-nonpreemptible/us-west1/1779433200000    |   0.000000000008780833333333336 |         154 |                 154 |
| compute/n2-nonpreemptible/us-west2/1779433200000    |   0.000000000010547222222222221 |         134 |                 134 |
| compute/n2-nonpreemptible/us-west3/1779433200000    |   0.000000000010547222222222221 |         105 |                 105 |
| compute/n2-nonpreemptible/us-west4/1779433200000    |   0.000000000009889722222222222 |          84 |                  84 |
| compute/n2-preemptible/us-central1/1779433200000    |   0.000000000002841666666666667 |          42 |                  42 |
| compute/n2-preemptible/us-east1/1779433200000       |   0.000000000002841666666666667 |          43 |                  43 |
| compute/n2-preemptible/us-east4/1779433200000       |   0.000000000002061111111111111 |         118 |                 118 |
| compute/n2-preemptible/us-west1/1779433200000       |   0.000000000002841666666666667 |          44 |                  44 |
| compute/n2-preemptible/us-west2/1779433200000       |  0.0000000000029750000000000006 |         162 |                 162 |
| compute/n2-preemptible/us-west3/1779433200000       |   0.000000000003769444444444445 |          58 |                  58 |
| compute/n2-preemptible/us-west4/1779433200000       |   0.000000000004400000000000001 |         146 |                 146 |
| memory/n2-nonpreemptible/us-central1/1779433200000  |  0.0000000000011493598090277779 |          93 |                  93 |
| memory/n2-nonpreemptible/us-east1/1779433200000     |  0.0000000000011493598090277779 |          94 |                  94 |
| memory/n2-nonpreemptible/us-east4/1779433200000     |  0.0000000000012942165798611112 |         142 |                 142 |
| memory/n2-nonpreemptible/us-west1/1779433200000     |  0.0000000000011493598090277779 |          95 |                  95 |
| memory/n2-nonpreemptible/us-west2/1779433200000     |  0.0000000000013804796006944447 |          45 |                  45 |
| memory/n2-nonpreemptible/us-west3/1779433200000     |  0.0000000000013804796006944447 |         164 |                 164 |
| memory/n2-nonpreemptible/us-west4/1779433200000     |  0.0000000000012942165798611112 |         201 |                 201 |
| memory/n2-preemptible/us-central1/1779433200000     |  0.0000000000003716362847222223 |          98 |                  98 |
| memory/n2-preemptible/us-east1/1779433200000        |  0.0000000000003716362847222223 |          99 |                  99 |
| memory/n2-preemptible/us-east4/1779433200000        | 0.00000000000027045355902777777 |         136 |                 136 |
| memory/n2-preemptible/us-west1/1779433200000        |  0.0000000000003716362847222223 |         100 |                 100 |
| memory/n2-preemptible/us-west2/1779433200000        |  0.0000000000003911675347222222 |         140 |                 140 |
| memory/n2-preemptible/us-west3/1779433200000        |  0.0000000000004939778645833333 |          97 |                  97 |
| memory/n2-preemptible/us-west4/1779433200000        |  0.0000000000005750868055555555 |         128 |                 128 |
+-----------------------------------------------------+---------------------------------+-------------+---------------------+

Test job using VM from pool on a deployment of this branch, showing that it's successfully used a pooled, preemptible n2 machine:
image
image
Raw SQL output corresponding to the above, in case you don't want to trust a bunch of "<$0.0001" from the UI:

mysql> select r.* from resources r inner join attempt_resources a on r.deduped_resource_id = a.deduped_resource_id;
+------------------------------------------------------+----------------------------------+-------------+---------------------+
| resource                                             | rate                             | resource_id | deduped_resource_id |
+------------------------------------------------------+----------------------------------+-------------+---------------------+
| service-fee/1                                        |    0.000000000002777777777777778 |           9 |                   9 |
| gcp-support-logs-specs-and-firewall-fees/1           |    0.000000000001388888888888889 |          10 |                  10 |
| ip-fee/preemptible/1024/1                            |   0.0000000000006781684027777778 |          11 |                  11 |
| compute/n2-preemptible/us-central1/1779433200000     |    0.000000000002841666666666667 |          42 |                  42 |
| memory/n2-preemptible/us-central1/1779433200000      |   0.0000000000003716362847222223 |          98 |                  98 |
| disk/pd-ssd/us-central1/1779433200000                |  0.00000000000006312861244201081 |         147 |                 147 |
| disk/local-ssd/preemptible/us-central1/1779433200000 | 0.000000000000010843267548863032 |         184 |                 184 |
+------------------------------------------------------+----------------------------------+-------------+---------------------+

Security Assessment

Delete all except the correct answer:

  • This change potentially impacts the Hail Batch instance as deployed by Broad Institute in GCP
    • The Impact Rating, Impact Description, and Appsec Review sections are required

Impact Rating

Delete all except the correct answer:

  • This change has a medium security impact

Impact Description

This change does not give users any new permissions, etc. when using Hail, but it introduces a new VM type on which users can run their jobs. Additionally, this phases out the n1 family as our default pooled VM resource and fully replaces it with n2 in the pool.

Appsec Review

  • Required: The impact has been assessed and approved by appsec

@grohli grohli requested a review from a team as a code owner May 22, 2026 19:20
@grohli grohli changed the title Upgrade hail pooled vms from n1 to n2 machines [batch] Upgrade pooled vms from n1 to n2 machines May 26, 2026
Comment thread batch/test/test_utils.py Outdated
Copy link
Copy Markdown
Member

ehigham commented May 26, 2026

🎉

Comment on lines 246 to 248
# Setup ops agent before anything else so startup failures are visible in Cloud Logging
touch /worker.log
touch /run.log
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably respect this by putting this above the SSD wrangling above

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of date comment... see the main one below

Comment on lines +211 to +218
# combine multiple local SSDs into a single RAID0 array
if [ "$NUM_LOCAL_SSDS" -gt 1 ]; then
DEVICES=""
for i in $(seq 1 $NUM_LOCAL_SSDS); do
DEVICES="$DEVICES /dev/nvme0n$i"
done
mdadm --create /dev/md0 --level=0 --raid-devices=$NUM_LOCAL_SSDS $DEVICES --force --run
fi
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent here seems to be to add RAID formatting before the remaining stuff (format worker disk, configure docker, etc). But maybe during a rebase, this has ended up duplicating content that is now on line 299+

I suspect (given the ops agent stuff I noticed below...) that my changes to support ubuntu 24 came in, moved some stuff around, and the rebase didn't notice that it needed to be careful and just duplicated all this content here instead.

NB: There's also a regression in this copied version, because it doesn't include the switching of containerd to the local ssd that I had to add for ubuntu 24

**n2_highcpu_machines,
'n2-highmem-128': MachineTypeParts(
cores=128,
memory=gib_to_bytes(864),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 1024?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Not sure why/how it ends up as 864 or why it breaks the 8x{cores} pattern, but that's what it is https://docs.cloud.google.com/compute/docs/general-purpose-machines#n2-high-mem

gpu_config=None,
machine_family='n2',
worker_type='highmem',
),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No n2-highcpu-128?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread batch/batch/cloud/gcp/resource_utils.py Outdated
'standard': [1, 2, 4, 8, 16, 32, 64, 96],
'highmem': [2, 4, 8, 16, 32, 64, 96],
'standard': [2, 4, 8, 16, 32, 48, 64, 80, 96, 128],
'highmem': [2, 4, 8, 16, 32, 48, 64, 80, 96],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a highmem 128 listed above, but not here?

return 1


def gcp_local_ssd_size(machine_family: str, cores: int) -> int:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was updated to work for n2. Is it still valid when people ask for custom machine types that are not n2 (g4, n1, e5, etc?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup -- We previously hardcoded in 375 GB for one SSD, but now we're gonna explicitly handle >1 SSD count in N2s via the new gcp_local_ssd_count().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And "375 GB" is now a constant (GCP_LOCAL_SSD_PARTITION_SIZE) rather than a hardcoded value https://github.com/grohli/hail/blob/d646fb3f663bfcfcdd0b2bb0292ff4fb20a32b4a/batch/batch/cloud/gcp/resource_utils.py#L346

Comment thread batch/test/test_utils.py Outdated
assert int(machine_parts.memory / machine_parts.cores / 1024**2) == 8192


def test_gcp_local_ssd_count():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge deal since they're so short anyway, but if you use pytest parameterized test here you'd get a pass/failure per case, and each case would run even if the preceding one failed. Also true for test_gcp_local_ssd_size below.

Comment thread batch/test/utils.py Outdated

if cloud == 'gcp':
return 'n1-standard-1'
return 'n2-standard-2'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is for custom machine tests? In which case I think n1-standard-1 is still valid and still smaller than n2-standard-2?

Comment on lines +264 to +270
if [ "$NUM_LOCAL_SSDS" -gt 1 ]; then
DEVICES=""
for i in $(seq 1 $NUM_LOCAL_SSDS); do
DEVICES="$DEVICES /dev/nvme0n$i"
done
mdadm --create /dev/md0 --level=0 --raid-devices=$NUM_LOCAL_SSDS $DEVICES --force --run
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RAID script constructs device paths incorrectly for N2 machines. The loop seq 1 $NUM_LOCAL_SSDS generates indices 1, 2, 3... which creates device names /dev/nvme0n1, /dev/nvme0n2, /dev/nvme0n3. However, when multiple local SSDs are attached to a GCP instance, they appear as /dev/nvme0n1, /dev/nvme0n2, etc. BUT the script is missing sudo for the mdadm command, and more critically, NVMe device indices in GCP start from a specific offset depending on the boot disk. The script should verify device existence before use.

if [ "$NUM_LOCAL_SSDS" -gt 1 ]; then
    DEVICES=""
    for i in $(seq 0 $(($NUM_LOCAL_SSDS - 1))); do
        DEVICE="/dev/nvme0n$((i+1))"
        # Wait for device to be ready
        while [ ! -e "$DEVICE" ]; do sleep 1; done
        DEVICES="$DEVICES $DEVICE"
    done
    sudo mdadm --create /dev/md0 --level=0 --raid-devices=$NUM_LOCAL_SSDS $DEVICES --force --run
fi

The missing sudo will cause the mdadm command to fail with permission denied.

Spotted by Graphite

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

grohli and others added 7 commits June 1, 2026 11:32
Switch GCP_MACHINE_FAMILY from 'n1' to 'n2' so all pool workers
create N2 VMs. Add N2 memory ratios, machine type entries
(standard/highmem/highcpu), and valid core counts. Update pricing
pipeline to recognize N2 SKUs and handle N2 memory SKU descriptions.
Existing N1 entries retained for GPU JPIM billing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…partitions and RAID0 when >1

N2 machines require local SSDs in specific quantities (e.g., n2-standard-16
requires minimum 2). The previous code always attached exactly 1, which GCP
rejects. This adds a per-core-count lookup for the minimum SSD count, attaches
that many SCRATCH disks in the VM config, and combines them via mdadm RAID0 in
the startup script when count > 1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Corrected the N2_MIN_LOCAL_SSD_COUNT_BY_CORES lookup table using the
actual values from GCP documentation. Four entries were underestimated:
32-core (2->4), 48-core (4->8), 64-core (4->8), 96-core (8->16).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge duplicate imports from batch.cloud.gcp.resource_utils into a single
statement to satisfy ruff's isort rules after merge with upstream main.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split aliased import from regular imports per ruff 0.11.13's isort rules.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix create_instance.py merge conflict: remove duplicated disk/docker
  block, keep containerd version for ubuntu 24, move ops agent setup
  before SSD wrangling, position RAID0 assembly correctly
- Revert smallest_machine_type to n1-standard-1 for custom machine tests
- Parameterize test_gcp_local_ssd_count and test_gcp_local_ssd_size
- Add 128 to highmem valid cores to match n2-highmem-128 in MACHINE_TYPE_TO_PARTS
- Combine duplicate imports in test_utils.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@grohli grohli force-pushed the upgrade-hail-pooled-vms-from-n1-to-n2-machines branch from 23a8ebb to 8a8771f Compare June 1, 2026 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants