fix(mmlu_pro): correct answer letters in prompt and choices by vineethsaivs · Pull Request #1269 · huggingface/lighteval

vineethsaivs · 2026-06-19T19:31:38Z

What does this PR do?

Fixes #1265. The MMLU-Pro task prompts models incorrectly, which makes the benchmark harder than the reference TIGER-AI-Lab harness (the issue reports a measurable score drop for non-reasoning runs).

mmlu_pro_prompt_function in src/lighteval/tasks/tasks/mmlu_pro.py had two defects:

The instruction hardcoded where LETTER is one of ABCD. MMLU-Pro questions have up to 10 options (A-J), so the model was told that only A-D were valid answers.
choices=ascii_uppercase[: len(choices)]. Here choices is the already-joined prompt string (tens of characters), so len(choices) far exceeds 26 and the slice returns the entire alphabet (all 26 letters A-Z) instead of one letter per option. It was meant to be len(line["options"]).

Fix

Build the answer letters once from len(line["options"]) and use them for both the choices block and Doc.choices, and enumerate the real letters in the instruction:

letters = list(ascii_uppercase)[: len(line["options"])]
choices = "\n".join(f"{letter}: {choice}" for letter, choice in zip(letters, line["options"]))
query = TEMPLATE.format(letters="".join(letters), question=line["question"], choices=choices)
return Doc(..., query=query, choices=letters, gold_index=line["answer_index"], instruction=query)

This mirrors the sibling gpqa task (choices=list(ascii_uppercase)[: len(choices)]) and makes Doc.choices a proper list[str]. gold_index alignment is unchanged (choices[gold_index] still yields the correct answer letter).

Scoped to the prompt/choices defect; the issue's secondary note about fewshot-format consistency with the official harness is left for a follow-up to keep this PR focused.

How was it tested?

Added tests/unit/tasks/test_mmlu_pro.py: for a 10-option question, Doc.choices == list("ABCDEFGHIJ") (not 26 letters) and the instruction says one of ABCDEFGHIJ (not one of ABCD); letters track the option count for a 4-option question; gold_index still maps to the right letter; and each option letter appears in the prompt. The two regression assertions fail on main and pass with this change.

mmlu_pro_prompt_function had two defects that made MMLU-Pro harder than the reference harness (a measurable eval-score drop for non-reasoning runs): 1. The instruction hardcoded "where LETTER is one of ABCD", but MMLU-Pro questions have up to 10 options (A-J), so the model was told only A-D were valid answers. 2. choices was set to ascii_uppercase[: len(choices)], where `choices` was the already-joined prompt string (tens of characters), so the slice saturated and returned all 26 letters A-Z instead of one letter per option. Build the answer letters once from len(line["options"]) and use them for both the choices block and Doc.choices, and enumerate the real letters in the instruction. This mirrors the sibling gpqa task (choices=list(ascii_uppercase)[: len(choices)]) and yields choices as a proper list[str]. gold_index alignment is unchanged. Scoped to the prompt/choices defect; the issue's secondary note about fewshot formatting consistency with the official harness is left for a follow-up. Fixes huggingface#1265

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mmlu_pro): correct answer letters in prompt and choices#1269

fix(mmlu_pro): correct answer letters in prompt and choices#1269
vineethsaivs wants to merge 1 commit into
huggingface:mainfrom
vineethsaivs:fix/mmlu-pro-prompt-format

vineethsaivs commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vineethsaivs commented Jun 19, 2026

What does this PR do?

Fix

How was it tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant