Skip to content

fix(mmlu_pro): correct answer letters in prompt and choices#1269

Open
vineethsaivs wants to merge 1 commit into
huggingface:mainfrom
vineethsaivs:fix/mmlu-pro-prompt-format
Open

fix(mmlu_pro): correct answer letters in prompt and choices#1269
vineethsaivs wants to merge 1 commit into
huggingface:mainfrom
vineethsaivs:fix/mmlu-pro-prompt-format

Conversation

@vineethsaivs

Copy link
Copy Markdown

What does this PR do?

Fixes #1265. The MMLU-Pro task prompts models incorrectly, which makes the benchmark harder than the reference TIGER-AI-Lab harness (the issue reports a measurable score drop for non-reasoning runs).

mmlu_pro_prompt_function in src/lighteval/tasks/tasks/mmlu_pro.py had two defects:

  1. The instruction hardcoded where LETTER is one of ABCD. MMLU-Pro questions have up to 10 options (A-J), so the model was told that only A-D were valid answers.
  2. choices=ascii_uppercase[: len(choices)]. Here choices is the already-joined prompt string (tens of characters), so len(choices) far exceeds 26 and the slice returns the entire alphabet (all 26 letters A-Z) instead of one letter per option. It was meant to be len(line["options"]).

Fix

Build the answer letters once from len(line["options"]) and use them for both the choices block and Doc.choices, and enumerate the real letters in the instruction:

letters = list(ascii_uppercase)[: len(line["options"])]
choices = "\n".join(f"{letter}: {choice}" for letter, choice in zip(letters, line["options"]))
query = TEMPLATE.format(letters="".join(letters), question=line["question"], choices=choices)
return Doc(..., query=query, choices=letters, gold_index=line["answer_index"], instruction=query)

This mirrors the sibling gpqa task (choices=list(ascii_uppercase)[: len(choices)]) and makes Doc.choices a proper list[str]. gold_index alignment is unchanged (choices[gold_index] still yields the correct answer letter).

Scoped to the prompt/choices defect; the issue's secondary note about fewshot-format consistency with the official harness is left for a follow-up to keep this PR focused.

How was it tested?

Added tests/unit/tasks/test_mmlu_pro.py: for a 10-option question, Doc.choices == list("ABCDEFGHIJ") (not 26 letters) and the instruction says one of ABCDEFGHIJ (not one of ABCD); letters track the option count for a 4-option question; gold_index still maps to the right letter; and each option letter appears in the prompt. The two regression assertions fail on main and pass with this change.

mmlu_pro_prompt_function had two defects that made MMLU-Pro harder than the
reference harness (a measurable eval-score drop for non-reasoning runs):

1. The instruction hardcoded "where LETTER is one of ABCD", but MMLU-Pro
   questions have up to 10 options (A-J), so the model was told only A-D were
   valid answers.
2. choices was set to ascii_uppercase[: len(choices)], where `choices` was the
   already-joined prompt string (tens of characters), so the slice saturated
   and returned all 26 letters A-Z instead of one letter per option.

Build the answer letters once from len(line["options"]) and use them for both
the choices block and Doc.choices, and enumerate the real letters in the
instruction. This mirrors the sibling gpqa task
(choices=list(ascii_uppercase)[: len(choices)]) and yields choices as a proper
list[str]. gold_index alignment is unchanged.

Scoped to the prompt/choices defect; the issue's secondary note about fewshot
formatting consistency with the official harness is left for a follow-up.

Fixes huggingface#1265
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MMLU-Pro incorrect prompt format

1 participant