-
Notifications
You must be signed in to change notification settings - Fork 267
[batch] Upgrade pooled vms from n1 to n2 machines #15497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
2abf1ee
e0e96a0
8c2a5f0
fefae77
64867f7
54f2dcb
8a8771f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,12 +6,15 @@ | |
|
|
||
| GCP_MAX_PERSISTENT_SSD_SIZE_GIB = 64 * 1024 | ||
|
|
||
| GCP_MACHINE_FAMILY = 'n1' | ||
| GCP_MACHINE_FAMILY = 'n2' | ||
|
|
||
| MEMORY_PER_CORE_MIB = { | ||
| ('n1', 'standard'): 3840, | ||
| ('n1', 'highmem'): 6656, | ||
| ('n1', 'highcpu'): 924, | ||
| ('n2', 'standard'): 4096, | ||
| ('n2', 'highmem'): 8192, | ||
| ('n2', 'highcpu'): 1024, | ||
| } | ||
|
|
||
|
|
||
|
|
@@ -116,13 +119,62 @@ def __init__(self, machine_family: str, worker_type: str, cores: int, memory: in | |
| for cores in [2, 4, 8, 16, 32, 64, 96] | ||
| } | ||
|
|
||
| # N2 Standard cores: 2 4 8 16 32 48 64 80 96 128 | ||
| # N2 Standard mem: 4 * cores GiB | ||
| n2_standard_machines = { | ||
| f'n2-standard-{cores}': MachineTypeParts( | ||
| cores=cores, | ||
| memory=gib_to_bytes(4 * cores), | ||
| gpu_config=None, | ||
| machine_family='n2', | ||
| worker_type='standard', | ||
| ) | ||
| for cores in [2, 4, 8, 16, 32, 48, 64, 80, 96, 128] | ||
| } | ||
|
|
||
| # N2 Highmem cores: 2 4 8 16 32 48 64 80 96 | ||
| # N2 Highmem mem: 8 * cores GiB | ||
| n2_highmem_machines = { | ||
| f'n2-highmem-{cores}': MachineTypeParts( | ||
| cores=cores, | ||
| memory=gib_to_bytes(8 * cores), | ||
| gpu_config=None, | ||
| machine_family='n2', | ||
| worker_type='highmem', | ||
| ) | ||
| for cores in [2, 4, 8, 16, 32, 48, 64, 80, 96] | ||
| } | ||
|
|
||
| # N2 Highcpu cores: 2 4 8 16 32 48 64 80 96 | ||
| # N2 Highcpu mem: 1024 * cores MiB | ||
| n2_highcpu_machines = { | ||
| f'n2-highcpu-{cores}': MachineTypeParts( | ||
| cores=cores, | ||
| memory=mib_to_bytes(1024 * cores), | ||
| gpu_config=None, | ||
| machine_family='n2', | ||
| worker_type='highcpu', | ||
| ) | ||
| for cores in [2, 4, 8, 16, 32, 48, 64, 80, 96] | ||
| } | ||
|
|
||
| MACHINE_TYPE_TO_PARTS = { | ||
| **n1_standard_t4_machines, | ||
| **n1_highmem_t4_machines, | ||
| **n1_highcpu_t4_machines, | ||
| **n1_standard_machines, | ||
| **n1_highmem_machines, | ||
| **n1_highcpu_machines, | ||
| **n2_standard_machines, | ||
| **n2_highmem_machines, | ||
| **n2_highcpu_machines, | ||
| 'n2-highmem-128': MachineTypeParts( | ||
| cores=128, | ||
| memory=gib_to_bytes(864), | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not 1024?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Correct. Not sure why/how it ends up as 864 or why it breaks the 8x{cores} pattern, but that's what it is https://docs.cloud.google.com/compute/docs/general-purpose-machines#n2-high-mem |
||
| gpu_config=None, | ||
| machine_family='n2', | ||
| worker_type='highmem', | ||
| ), | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| 'g2-standard-4': MachineTypeParts( | ||
| cores=4, | ||
| memory=gib_to_bytes(16), | ||
|
|
@@ -245,9 +297,9 @@ def __init__(self, machine_family: str, worker_type: str, cores: int, memory: in | |
| } | ||
|
|
||
| gcp_valid_cores_for_pool_worker_type = { | ||
| 'highcpu': [2, 4, 8, 16, 32, 64, 96], | ||
| 'standard': [1, 2, 4, 8, 16, 32, 64, 96], | ||
| 'highmem': [2, 4, 8, 16, 32, 64, 96], | ||
| 'standard': [2, 4, 8, 16, 32, 48, 64, 80, 96, 128], | ||
| 'highmem': [2, 4, 8, 16, 32, 48, 64, 80, 96, 128], | ||
| 'highcpu': [2, 4, 8, 16, 32, 48, 64, 80, 96], | ||
| } | ||
|
|
||
| gcp_valid_machine_types = list(MACHINE_TYPE_TO_PARTS.keys()) | ||
|
|
@@ -291,8 +343,38 @@ def gcp_is_valid_storage_request(storage_in_gib: int) -> bool: | |
| return 10 <= storage_in_gib <= GCP_MAX_PERSISTENT_SSD_SIZE_GIB | ||
|
|
||
|
|
||
| def gcp_local_ssd_size() -> int: | ||
| return 375 | ||
| GCP_LOCAL_SSD_PARTITION_SIZE_GIB = 375 | ||
|
|
||
| # N2 machines require local SSDs in specific quantities that vary by vCPU count. | ||
| # Source: https://docs.cloud.google.com/compute/docs/general-purpose-machines#n2-standard | ||
| N2_MIN_LOCAL_SSD_COUNT_BY_CORES = { | ||
| 2: 1, | ||
| 4: 1, | ||
| 8: 1, | ||
| 16: 2, | ||
| 32: 4, | ||
| 48: 8, | ||
| 64: 8, | ||
| 80: 8, | ||
| 96: 16, | ||
| 128: 16, | ||
| } | ||
|
|
||
|
|
||
| def gcp_local_ssd_count(machine_family: str, cores: int) -> int: | ||
| if machine_family != 'n2': | ||
| return 1 | ||
| count = N2_MIN_LOCAL_SSD_COUNT_BY_CORES.get(cores) | ||
| if count is not None: | ||
| return count | ||
| for threshold_cores in sorted(N2_MIN_LOCAL_SSD_COUNT_BY_CORES.keys(), reverse=True): | ||
| if cores >= threshold_cores: | ||
| return N2_MIN_LOCAL_SSD_COUNT_BY_CORES[threshold_cores] | ||
| return 1 | ||
|
|
||
|
|
||
| def gcp_local_ssd_size(machine_family: str, cores: int) -> int: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was updated to work for n2. Is it still valid when people ask for custom machine types that are not n2 (g4, n1, e5, etc?)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup -- We previously hardcoded in 375 GB for one SSD, but now we're gonna explicitly handle >1 SSD count in N2s via the new gcp_local_ssd_count().
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And "375 GB" is now a constant ( |
||
| return GCP_LOCAL_SSD_PARTITION_SIZE_GIB * gcp_local_ssd_count(machine_family, cores) | ||
|
|
||
|
|
||
| def machine_type_to_gpu(machine_type: str) -> Optional[str]: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RAID script constructs device paths incorrectly for N2 machines. The loop
seq 1 $NUM_LOCAL_SSDSgenerates indices 1, 2, 3... which creates device names/dev/nvme0n1,/dev/nvme0n2,/dev/nvme0n3. However, when multiple local SSDs are attached to a GCP instance, they appear as/dev/nvme0n1,/dev/nvme0n2, etc. BUT the script is missing sudo for the mdadm command, and more critically, NVMe device indices in GCP start from a specific offset depending on the boot disk. The script should verify device existence before use.The missing
sudowill cause the mdadm command to fail with permission denied.Spotted by Graphite

Is this helpful? React 👍 or 👎 to let us know.