Suggestion: Adding outlier-resistant averaging methods

#968
by zelk12 - opened

Add an option of outputting model parameters, taking into account exploding (very large values in one of the columns). To be able to find models, average equals, capable of solving all of the problems presented here in the tests.

изображение.png

изображение.png

Link to the verification .xlsx file in Google Drive.

zelk12 changed discussion title from suggestion: Additional option for outputting indicators to Suggestion: Additional option for outputting indicators
Open LLM Leaderboard org

Hi @zelk12 ,

Thank you for your suggestion! Do I understand you correctly that you would like to add a colour differentiation of results on Leaderboard? Or what models' parameters do you want to see?

Hello. This is more about ?calculating the results?. In general, if we look at Figure 1, we can see that the first model, by average1, should be in second place. But by average2, it is in first place.

изображение.png

изображение.png

изображение.png

The second half is calculated by taking into account values that are very different in the results.

As an example, I will set one of the parameters of the third model to 9,000.

изображение.png

Here we can see that in the first table, due to the calculation of the mean, model 3 is in the lead, but in the second table it can only rank second.

The same is true if we set the model two parameters as 9,000.

изображение.png

It is only when we set the 3 parameters as 9,000 that model 3 and in the second table ranks 1 in terms of average.

изображение.png

Something like that. Unfortunately, I'm not very good at explaining things.

Open LLM Leaderboard org

I think I got your idea, thank you!
You're pointing out that the current method of calculating averages doesn't account for extreme values in one or more columns, which can skew the results. So the goal of harmonising the average score is to find models that perform well across all tasks, rather than letting outliers dominate the average score

This idea makes sense, we need to discuss it internally and I will get back to you with my answer

Open LLM Leaderboard org

Let me rename the discussion, feel free to correct me

alozowski changed discussion title from Suggestion: Additional option for outputting indicators to Suggestion: Adding outlier-resistant averaging methods

Ok. And yes, you probably have the right definition.

Open LLM Leaderboard org

I'm back with our thoughts – we've decided to maintain our current arithmetic mean approach due to its simplicity and wide understanding. Plus, since we're currently normalising the scores, we're mitigating an outlier effect

Nevertheless, I will keep in mind your approach and might get back to it later

Let me close this discussion for now, we greatly appreciate your involvement! Please, feel free to share any your ideas here in discussions and don't hesitate to ask questions in case of any problems!

alozowski changed discussion status to closed

As an option, this approach can be added to another column.

Open LLM Leaderboard org

Yes, we discussed it as a separate column, but the logic remains the same as I've described above

if anyone here wants only: over-fitting (too high score) outlier suppression, but not remove low-score outliers,
then maybe some mix of geometric mean and harmonic mean,

instead using odds ratio (should be truncated at extremes near 1 or 0, to avoid problems) based averaging may instead keep significance of result e.g. 0.99 vs 0.95 vs 0.9, and 0.01 vs 0.03 vs 0.1,
and using a weighted average of these, is maybe an option?

Open LLM Leaderboard org

Hi!
We try to avoid adding too many options to the leaderboard to keep it usable by the majority of people. If you want to compute your own custom geometric/harmonic means on the results, you can do so by downloading the contents here: https://huggingface.co/datasets/open-llm-leaderboard/contents/tree/main

Sign up or log in to comment