MMLU blog post discussion

#82
by thomwolf HF staff - opened
Open LLM Leaderboard org

This is a discussion page for the blog post diving in all the various ways MMLU can be evaluated (in particular for the Falcon and LLaMA models): and available at https://huggingface.co/blog/evaluating-mmlu-leaderboard

Is there a script/code to regenerate all the metrics from the blog post? thanks!

Ideally, a good test should be realistic, unambiguous, luckless, and easy to understand. Showing fairness is easier to do by the negative:

  1. If a model passes a question, but if you asked it in a chat, it would never give the right answer, then the test is not realistic. So HELM’s rejecting an answer if it is not the highest-probability one is reasonable.
  2. If a model sometimes had a high pass rate, sometimes low, its result would be ambiguous. So realism should not go all the way to using normal sampling like nucleus9. Yet…
  3. If a model passes a question, but if you asked in a chat, the answer would be basically random, then the test is lucky. So the test should account for how close the probability is for each answer: if they are all near-equal, but the right one is imperceptibly higher, then that should be taken into account.
  4. Besides, if a test result makes it unclear just how bad it is, then it is harder to understand. NeoX’s 25% could be mistaken for an OK score, but it is essentially as good as a coin flip.

What if we averaged the probability of the right answer across tasks?

  • The result would be on a clear centigrade scale (0% is bad, 100% is good).
  • Uncertainty between answers (nearby probabilities) would negatively impact the score.
  • It is also clearer, making it less likely that people would implement it differently (apart from the few-shot variations).

I see that models from EleutherAI/gpt-neox-20b are good of evaluated with HELM (Harness). And almost all of the next models follow the same trend. This means the models are good at predicting the probabilities of the whole answer rather than the option (from what I understand from the article). Is there any reason for that? I find it quite interesting.

There's a spelling error for the word 'implementation'. Didn't catch anything else. Good article! :)

"MMLU comes in all shapes and sizes: Looking at the prompts
Let’s compare an example of prompt each benchmark sends to the models by each implmentation for the same MMLU dataset example:"

Great article! We have experienced something similar while developing InstructEval (https://declare-lab.net/instruct-eval/). Codes are here: https://github.com/declare-lab/instruct-eval

in your detailed number ranking, with MMLU original implementation, llama30B is better than falcon40B so it in the map it should be #2 not #3.

I see now HELM as a broken evaluation. Indeed, most of the LLMs tend to have a conversational tone for the responses, so it's bizarre to expect the first generated token will be a choice number.

Another way to select the answer from the output of LLMs would be via knn. We just generate a text from LLMs and then see what is the closest answer that corresponds to it.

Open LLM Leaderboard org
edited Jul 19, 2023

@Linbo Yes, the llama 2 models scores are completely correct/should be reproducible using the Harness.
They were launched after the debug end of last week :)
(Plus, people at Meta told us we were "in range" ^^)

@Linbo Yes, the llama 2 models scores are completely correct/should be reproducible using the Harness.
They were launched after the debug end of last week :)
(Plus, people at Meta told us we were "in range" ^^)

What happens to the models that have wrong scores? Will they be re-evaluated? Does that happen automatically? Do they have to be submitted again?

Open LLM Leaderboard org

We are re-running all the llama-based models as we speak, however, if you fear that your model is not being re-ran, please open an issue and tag @SaylorTwift and me, and we'll take care of it asap.

@clefourrier For the MMLU score reported on the leaderboard, the reproducibility section says that it's the acc of all, but doesn't indicate whether that accuracy is arrived at by doing an average of the individual task accuracies, or a weighted average based on the total number of items per task, or something else... which is HuggingFace doing, please? (If there's a code repo for the leaderboard somewhere that I could be looking at instead of asking these questions, please point me there!)

It looks like the original MMLU code (https://github.com/hendrycks/test/blob/master/evaluate.py) does a weighted average of the items within each subject area (across the tasks grouped within that subject area), but given that the number of items isn't part of the lm-evaluation-harness output for the hendrycksTest tasks, it seems less obvious how to weight the results for the various tasks using that output.

Thanks for any insight into this that you can share!

Open LLM Leaderboard org

@emilyva The score is just a simple average of individual tasks accuracies :)

@clefourrier That was my guess (especially after comparing that result to the leaderboard published result for one of the models), but thanks for confirming!

Open LLM Leaderboard org

Closing due to inactivity (but it's linked in the resources tab for archival purposes)

clefourrier changed discussion status to closed

harness eval performs bad when the correct answer is short

Sign up or log in to comment