Fix sample-cache corruption under accelerate data-parallel by shipbehaves · Pull Request #1271 · huggingface/lighteval

shipbehaves · 2026-06-22T14:58:23Z

What

Under an accelerate data-parallel launch, SampleCache.get_cache_path is not rank-aware, so every rank's @cached wrapper calls cache_samples and writes the same parquet file concurrently. The interleaved writes corrupt the file, and the subsequent get_samples_from_cache / load_dataset call then fails to load it.

Fix

In the cached wrapper (src/lighteval/utils/cache_management.py), only the main process writes the cache, with a wait_for_everyone() barrier so the other ranks wait for that write before they read:

accelerator = getattr(self, "accelerator", None)
if accelerator is None or accelerator.is_main_process:
    cache.cache_samples(...)
if accelerator is not None:
    accelerator.wait_for_everyone()

This is safe with no data loss: by the time results reach the wrapper they are already gathered across ranks (e.g. pad_and_gather in the transformers backend), so the main process holds the full set. This matches @NathanHB's suggestion on the issue. Backends without an accelerator (getattr returns None) are unaffected, and on a single process the guard is a no-op.

Test

Adds test_cache_only_main_process_writes in tests/unit/utils/test_caching.py: with a mocked non-main process the cache file is not written and the barrier is hit; with the main process it is written. It fails on main and passes with this change.

AI-assisted: drafted and tested with AI help, written and reviewed by a human who understands and stands behind the change.

Under an accelerate data-parallel launch every rank holds the full gathered results and wrote the same parquet cache file concurrently, corrupting it and making subsequent loads fail. Write the cache only on the main process and add a barrier so the other ranks wait for that write before reading. Add a regression test. Fixes huggingface#1102

bot-ci-comment · 2026-06-23T10:08:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

NathanHB · 2026-06-23T12:20:15Z

Hey! There seems to be an issue with the tests, will merge when fixed :)

NathanHB approved these changes Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix sample-cache corruption under accelerate data-parallel#1271

Fix sample-cache corruption under accelerate data-parallel#1271
shipbehaves wants to merge 1 commit into
huggingface:mainfrom
shipbehaves:fix/cache-only-main-process-writes

shipbehaves commented Jun 22, 2026

Uh oh!

bot-ci-comment Bot commented Jun 23, 2026

Uh oh!

NathanHB commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

shipbehaves commented Jun 22, 2026

What

Fix

Test

Uh oh!

bot-ci-comment Bot commented Jun 23, 2026

Uh oh!

NathanHB commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants