Fix sample-cache corruption under accelerate data-parallel#1271
Open
shipbehaves wants to merge 1 commit into
Open
Fix sample-cache corruption under accelerate data-parallel#1271shipbehaves wants to merge 1 commit into
shipbehaves wants to merge 1 commit into
Conversation
Under an accelerate data-parallel launch every rank holds the full gathered results and wrote the same parquet cache file concurrently, corrupting it and making subsequent loads fail. Write the cache only on the main process and add a barrier so the other ranks wait for that write before reading. Add a regression test. Fixes huggingface#1102
NathanHB
approved these changes
Jun 23, 2026
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Member
|
Hey! There seems to be an issue with the tests, will merge when fixed :) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes #1102.
Under an
acceleratedata-parallel launch,SampleCache.get_cache_pathis not rank-aware, so every rank's@cachedwrapper callscache_samplesand writes the same parquet file concurrently. The interleaved writes corrupt the file, and the subsequentget_samples_from_cache/load_datasetcall then fails to load it.Fix
In the
cachedwrapper (src/lighteval/utils/cache_management.py), only the main process writes the cache, with await_for_everyone()barrier so the other ranks wait for that write before they read:This is safe with no data loss: by the time results reach the wrapper they are already gathered across ranks (e.g.
pad_and_gatherin the transformers backend), so the main process holds the full set. This matches @NathanHB's suggestion on the issue. Backends without an accelerator (getattrreturnsNone) are unaffected, and on a single process the guard is a no-op.Test
Adds
test_cache_only_main_process_writesintests/unit/utils/test_caching.py: with a mocked non-main process the cache file is not written and the barrier is hit; with the main process it is written. It fails onmainand passes with this change.AI-assisted: drafted and tested with AI help, written and reviewed by a human who understands and stands behind the change.