LLM Benchmarks and Data Leakage

#1
by dvamvour - opened

The DeepSeek paper does not mention anything about the datasets used for RL training phase or what were the exact rules to calculate the reward for a given solution proposed by the policy. If hype and quality is measured mostly based on benchmarks like AIME, MATH500, Codeforces etc, a question that arises is what prevents any company from using a good amount of these benchmark data during training as well as testing, thus hacking their performance on these benchmarks? Even, with no direct data leakage a lot of these questions could be replicated with slight modifications and be used as training data, thus artificially inflating model performance on these benchmarks. Is there any way of verifying this apart from good faith?

Open R1 org

Yes, that's an interesting question for sure, we will try to do it and of course we welcome contribution for this!

Yeah it would be interesting to do some ablation study to show how performance on these benchmarks would change as more samples from the same benchmarks are used for training.

we could do some contamination analysis of deepseek-r1, that should give us some idea.

Sign up or log in to comment