[batch] Use gcloud to localize large input files from GCS by kush-chandra · Pull Request #15350 · hail-is/hail

kush-chandra · 2026-03-19T15:00:22Z

Change Description

This is a redo of #15273 which addresses the scalability issues with that approach:

We now only use the chunked gcloud downloader if the file is large enough to benefit. Files below the chunk size will be downloaded in a single process, as before.
Both the gcloud download paths are now bounded by thread count & memory buffer usage, preventing memory errors on lowmem VMs.

The timing improvements for large files remain the same as in the original change, and there is now no spike in latency when downloading hundreds of smaller files.

Security Assessment

This change potentially impacts the Hail Batch instance as deployed by Broad Institute in GCP

Impact Rating

This change has a medium security impact

Impact Description

This change instantiates a new Google client to download files from GCS buckets. It uses the same local credential files as the existing custom clients we've created. I manually verified the problem cases with the original change now work as intended and am working on adding additional testing for those paths.

Appsec Review

Required: The impact has been assessed and approved by appsec

cjllanwarne · 2026-03-19T19:22:30Z

+                timeout=self._timeout,
+            )
+        except FileNotFoundError:
+            os.makedirs(os.path.dirname(local_dest), exist_ok=True)


Can we check ahead of time that the directory exists (instead of letting it throw an exception that we then have to catch do resubmit the exact same command in the except block?)

Yep, I've refactored this. The original copier code used that pattern to rely on an implementation detail of local_fs to catch an edge case where we're trying to expand a directory into an existing file which is not a directory, so I made that check explicit here.

cjllanwarne · 2026-03-19T19:23:08Z

+                max_workers=8,
+            )
+        except FileNotFoundError:
+            os.makedirs(os.path.dirname(local_dest), exist_ok=True)


Same thought here

cjllanwarne · 2026-03-19T19:36:49Z

+            success = False
+            threads_acquired = 0
+            try:
+                for _ in range(0, 8):


There's two risky things here:

We already acquired a sema slot from bounded_gather2. Especially with the xfer_sema waits introducing an easy place for tasks to be paused/suspended, it's very possible that lots of tasks get a bounded_gather2 sema slots and so the sema pool is empty before we try to acquire another 8 slots for the download here

Because this acquiring logic is not atomic, it's possible that a thread might pick up a sub-portion of its 8 required slots here, then get suspended, and those reserved slots are locked up in this task which isn't doing anything with them yet. It looks like the weighted semaphor semantics is maybe more what you're looking for here (you can atomically request a number of slots)?

I changed the copier semaphore to a weighted semaphore so this can atomically request 8 threads from the shared pool. xfer_sema appears to allow more simultaneous transfers than sema does, so I think this mitigates the deadlock risk. (The original copier has the same thresholds for sema & xfer_sema, so I'm wondering how much value xfer_sema is actually providing like this.)

cjllanwarne · 2026-03-19T20:09:38Z

        )
+        credentials = kwargs.get('credentials')
+        if isinstance(credentials, GoogleCredentials):
+            access_token = credentials.access_token


I believe these access tokens have a 1 hour lifespan before they'll be rejected. We can split on the type of google credentials we have and use one of Credentials.from_service_account_info(credentials.key) for service accounts or Credentials(token=None, refresh_token=credentials.credentials['refresh_token'], client_id=credentials.credentials['client_id'], client_secret=credentials.credentials['client_secret'], token_uri='https://oauth2.googleapis.com/token') for application-default credentials, or Client(credentials = None) for automatic application-default credentials.

In fact, credentials.access_token looks like a method (and an async method at that), so this would return a method to produce a coroutine rather than a value, so I'm not sure this code path would work even temporarily. I think we must be going the credentials=None / application-default route below during the on-VM testing, but I'm not sure it'd work if we tried to pass credentials in on a local laptop

Thanks for catching that. I split it up into service account/token creds/anonymous creds/application default creds. From what I could tell, VM testing is indeed using the default path.

graphite-app · 2026-04-06T04:40:41Z

+        if not os.path.exists(os.path.dirname(dest)):
+            os.makedirs(os.path.dirname(dest), exist_ok=True)


Blocking I/O operations (os.path.exists and os.makedirs) are called in an async method without being wrapped in blocking_to_async. This blocks the event loop.

if not os.path.exists(os.path.dirname(dest)): os.makedirs(os.path.dirname(dest), exist_ok=True) # Should be: await blocking_to_async(self._thread_pool, os.makedirs, os.path.dirname(dest), exist_ok=True)

Also, the os.path.exists check is redundant since exist_ok=True handles existing directories.

Suggested change

if not os.path.exists(os.path.dirname(dest)):

os.makedirs(os.path.dirname(dest), exist_ok=True)

await blocking_to_async(self._thread_pool, os.makedirs, os.path.dirname(dest), exist_ok=True)

Spotted by Graphite

Is this helpful? React 👍 or 👎 to let us know.

cjllanwarne · 2026-04-15T15:05:20Z

If #15395 merges, the test batch for this PR would only have run the following steps:

[
    'check_hail',
    'check_pip_requirements',
    'check_services',
    'merge_code',
    'test_auth',
    'test_auth_copy_paste_login',
    'test_auth_copy_paste_login_timeout',
    'test_batch',
    'test_batch_docs',
    'test_batch_invariants',
    'test_batch_job_private_machines',
    'test_ci',
    'test_ci_unit',
    'test_hail_python',
    'test_hail_python_local_backend',
    'test_hail_python_service_backend_gcp',
    'test_hail_python_unchecked_allocator',
    'test_hailctl_batch',
    'test_hailtop_python',
    'test_hailtop_python_fs',
    'test_monitoring',
]

…15273) ## Change Description Fixes hail-is#15011 Google vends a tool for parallelized file downloads in their Python client library: https://docs.cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.transfer_manager.html Using this instead of our copier code to download from GCS buckets to local storage lets us download larger files which our copier code cannot handle without frequent timeouts & failures. I compared the performance when a job downloads an input file using the copier vs using the Google library: | File size | Copier code | transfer_manager | |--------|--------|--------| | 500B | 1s | 0.8s | | 5M | 0.6s | 0.8s | | 1G | 50s | 2.7s | | 5G | 100s | 13.8s | | 10G | 219s | 27s | | 21G | failed | 54s | | 54G | failed | 135s | Performance is similar for smaller files with variance mostly due to network bandwidth. At 1G, the copier starts to hit transient errors on some chunks & retries. Beyond 20G, jobs using the copier failed in the input stage after an indeterminate length of time. ## Security Assessment Delete all except the correct answer: - This change potentially impacts the Hail Batch instance as deployed by Broad Institute in GCP ### Impact Rating Delete all except the correct answer: - This change has a medium security impact ### Impact Description Replace this content with a description of the impact of the change: This change instantiates a new Google client to download files from GCS buckets. It uses the same local credential files as the existing custom clients we've created. ### Appsec Review - [x] Required: The impact has been assessed and approved by appsec --------- Co-authored-by: Chris Llanwarne <cjllanwarne@users.noreply.github.com> # Conflicts: # gear/pinned-requirements.txt # hail/python/hailtop/pinned-requirements.txt # hail/python/pinned-requirements.txt # Conflicts: # gear/pinned-requirements.txt # hail/python/hailtop/pinned-requirements.txt # hail/python/pinned-requirements.txt

cjllanwarne

I have a few thoughts -

I don't fully understand what's happening here. Would you be willing to give a walkthrough?
I'm a little scared given how many big - and pretty foundational - changes are happening to such an intricate (brittle!) system
One of my original big hopes was to reduce the maintenance overhead of our in-house copier having to re-implement gcloud storage cp in python code. But we are doubling down on that in this PR

Before we commit to this route... I'd really like to see what a "shell out to gcloud storage cp" version would look like. Even if we have to make a couple of calls because it doesn't support many <=> many directory transfers, that still feels like it would be better than continuing to follow this "re-implement it ourselves" codebase. What do you think?

kush-chandra requested a review from a team as a code owner March 19, 2026 15:00

graphite-app Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread hail/python/hailtop/aiocloud/aiogoogle/client/storage_client.py Outdated

Comment thread hail/python/hailtop/aiocloud/aiogoogle/client/storage_client.py Outdated

kush-chandra requested a review from cjllanwarne March 19, 2026 15:08

kush-chandra force-pushed the kchandra-copier-memory-fix branch from fbc83d5 to c514da6 Compare March 19, 2026 15:45

cjllanwarne requested changes Mar 19, 2026

View reviewed changes

graphite-app Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread hail/python/hailtop/aiotools/fs/copier.py Outdated

Comment thread hail/python/hailtop/aiotools/weighted_semaphore.py Outdated

kush-chandra force-pushed the kchandra-copier-memory-fix branch 3 times, most recently from 8a18981 to c1b9bbf Compare April 1, 2026 14:22

graphite-app Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread hail/python/hailtop/aiotools/copy.py Outdated

kush-chandra force-pushed the kchandra-copier-memory-fix branch 7 times, most recently from 1cc7e2d to 289df0d Compare April 6, 2026 03:52

graphite-app Bot reviewed Apr 6, 2026

View reviewed changes

Comment thread hail/python/hailtop/aiocloud/aiogoogle/client/storage_client.py Outdated

kush-chandra force-pushed the kchandra-copier-memory-fix branch from 289df0d to eb7b283 Compare April 6, 2026 04:35

graphite-app Bot reviewed Apr 6, 2026

View reviewed changes

kush-chandra force-pushed the kchandra-copier-memory-fix branch 3 times, most recently from ead1b66 to 442912c Compare April 7, 2026 02:16

graphite-app Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread hail/python/hailtop/aiocloud/aiogoogle/client/storage_client.py Outdated

Comment thread hail/python/hailtop/aiotools/fs/copier.py Outdated

kush-chandra force-pushed the kchandra-copier-memory-fix branch 2 times, most recently from d5ee35f to b9549ca Compare April 9, 2026 14:44

graphite-app Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread hail/python/hailtop/aiotools/copy.py Outdated

kush-chandra force-pushed the kchandra-copier-memory-fix branch 2 times, most recently from a48fe0e to 74a0e0a Compare April 14, 2026 18:37

graphite-app Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread hail/python/hailtop/aiocloud/aiogoogle/client/storage_client.py Outdated

kush-chandra force-pushed the kchandra-copier-memory-fix branch from 74a0e0a to 699d16c Compare April 14, 2026 18:47

kush-chandra force-pushed the kchandra-copier-memory-fix branch from 699d16c to f7db91a Compare April 21, 2026 14:28

kush-chandra added 9 commits April 21, 2026 11:20

Handle normal/large GCP -> local transfers separately

17cab36

Bound copy processes with WeightedSemaphore

3bc3b07

Update copy tests to use WeightedSemaphore

ef1eab7

Add check for local file dest

3b1b7d0

Fix type assertions for old path

f3787ce

Fix directory expansion bug

18e9995

Defer to GCS copier from within copier flow

0550eb3

Defer to GCS copier from within copier flow

de1008d

kush-chandra force-pushed the kchandra-copier-memory-fix branch from f7db91a to 5c61612 Compare April 21, 2026 15:29

Fix empty dest path copy

fae031c

kush-chandra force-pushed the kchandra-copier-memory-fix branch from 5c61612 to fae031c Compare April 21, 2026 15:39

kush-chandra requested a review from cjllanwarne April 21, 2026 18:34

cjllanwarne assigned kush-chandra Apr 23, 2026

cjllanwarne requested changes Apr 24, 2026

View reviewed changes

		if not os.path.exists(os.path.dirname(dest)):
		os.makedirs(os.path.dirname(dest), exist_ok=True)

	if not os.path.exists(os.path.dirname(dest)):
	os.makedirs(os.path.dirname(dest), exist_ok=True)
	await blocking_to_async(self._thread_pool, os.makedirs, os.path.dirname(dest), exist_ok=True)

Conversation

kush-chandra commented Mar 19, 2026

Change Description

Security Assessment

Impact Rating

Impact Description

Appsec Review

Uh oh!

Uh oh!

Uh oh!

cjllanwarne Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

kush-chandra Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

cjllanwarne Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

cjllanwarne Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kush-chandra Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

cjllanwarne Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kush-chandra Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

graphite-app Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjllanwarne commented Apr 15, 2026

Uh oh!

cjllanwarne left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cjllanwarne Mar 19, 2026 •

edited

Loading

cjllanwarne Mar 19, 2026 •

edited

Loading