Skip to content

Fix sample-cache corruption under accelerate data-parallel#1271

Open
shipbehaves wants to merge 1 commit into
huggingface:mainfrom
shipbehaves:fix/cache-only-main-process-writes
Open

Fix sample-cache corruption under accelerate data-parallel#1271
shipbehaves wants to merge 1 commit into
huggingface:mainfrom
shipbehaves:fix/cache-only-main-process-writes

Conversation

@shipbehaves

Copy link
Copy Markdown

What

Fixes #1102.

Under an accelerate data-parallel launch, SampleCache.get_cache_path is not rank-aware, so every rank's @cached wrapper calls cache_samples and writes the same parquet file concurrently. The interleaved writes corrupt the file, and the subsequent get_samples_from_cache / load_dataset call then fails to load it.

Fix

In the cached wrapper (src/lighteval/utils/cache_management.py), only the main process writes the cache, with a wait_for_everyone() barrier so the other ranks wait for that write before they read:

accelerator = getattr(self, "accelerator", None)
if accelerator is None or accelerator.is_main_process:
    cache.cache_samples(...)
if accelerator is not None:
    accelerator.wait_for_everyone()

This is safe with no data loss: by the time results reach the wrapper they are already gathered across ranks (e.g. pad_and_gather in the transformers backend), so the main process holds the full set. This matches @NathanHB's suggestion on the issue. Backends without an accelerator (getattr returns None) are unaffected, and on a single process the guard is a no-op.

Test

Adds test_cache_only_main_process_writes in tests/unit/utils/test_caching.py: with a mocked non-main process the cache file is not written and the barrier is hit; with the main process it is written. It fails on main and passes with this change.


AI-assisted: drafted and tested with AI help, written and reviewed by a human who understands and stands behind the change.

Under an accelerate data-parallel launch every rank holds the full gathered
results and wrote the same parquet cache file concurrently, corrupting it and
making subsequent loads fail. Write the cache only on the main process and add
a barrier so the other ranks wait for that write before reading. Add a
regression test.

Fixes huggingface#1102
@bot-ci-comment

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@NathanHB

Copy link
Copy Markdown
Member

Hey! There seems to be an issue with the tests, will merge when fixed :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Accelerate backend with dp fails to load cache

2 participants