fix(mmlu_pro): correct answer letters in prompt and choices#1269
Open
vineethsaivs wants to merge 1 commit into
Open
fix(mmlu_pro): correct answer letters in prompt and choices#1269vineethsaivs wants to merge 1 commit into
vineethsaivs wants to merge 1 commit into
Conversation
mmlu_pro_prompt_function had two defects that made MMLU-Pro harder than the reference harness (a measurable eval-score drop for non-reasoning runs): 1. The instruction hardcoded "where LETTER is one of ABCD", but MMLU-Pro questions have up to 10 options (A-J), so the model was told only A-D were valid answers. 2. choices was set to ascii_uppercase[: len(choices)], where `choices` was the already-joined prompt string (tens of characters), so the slice saturated and returned all 26 letters A-Z instead of one letter per option. Build the answer letters once from len(line["options"]) and use them for both the choices block and Doc.choices, and enumerate the real letters in the instruction. This mirrors the sibling gpqa task (choices=list(ascii_uppercase)[: len(choices)]) and yields choices as a proper list[str]. gold_index alignment is unchanged. Scoped to the prompt/choices defect; the issue's secondary note about fewshot formatting consistency with the official harness is left for a follow-up. Fixes huggingface#1265
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes #1265. The MMLU-Pro task prompts models incorrectly, which makes the benchmark harder than the reference TIGER-AI-Lab harness (the issue reports a measurable score drop for non-reasoning runs).
mmlu_pro_prompt_functioninsrc/lighteval/tasks/tasks/mmlu_pro.pyhad two defects:where LETTER is one of ABCD. MMLU-Pro questions have up to 10 options (A-J), so the model was told that only A-D were valid answers.choices=ascii_uppercase[: len(choices)]. Herechoicesis the already-joined prompt string (tens of characters), solen(choices)far exceeds 26 and the slice returns the entire alphabet (all 26 letters A-Z) instead of one letter per option. It was meant to belen(line["options"]).Fix
Build the answer letters once from
len(line["options"])and use them for both the choices block andDoc.choices, and enumerate the real letters in the instruction:This mirrors the sibling
gpqatask (choices=list(ascii_uppercase)[: len(choices)]) and makesDoc.choicesa properlist[str].gold_indexalignment is unchanged (choices[gold_index]still yields the correct answer letter).Scoped to the prompt/choices defect; the issue's secondary note about fewshot-format consistency with the official harness is left for a follow-up to keep this PR focused.
How was it tested?
Added
tests/unit/tasks/test_mmlu_pro.py: for a 10-option question,Doc.choices == list("ABCDEFGHIJ")(not 26 letters) and the instruction saysone of ABCDEFGHIJ(notone of ABCD); letters track the option count for a 4-option question;gold_indexstill maps to the right letter; and each option letter appears in the prompt. The two regression assertions fail onmainand pass with this change.