open-r1/README · LLM Benchmarks and Data Leakage

3 days ago

•

The DeepSeek paper does not mention anything about the datasets used for RL training phase or what were the exact rules to calculate the reward for a given solution proposed by the policy. If hype and quality is measured mostly based on benchmarks like AIME, MATH500, Codeforces etc, a question that arises is what prevents any company from using a good amount of these benchmark data during training as well as testing, thus hacking their performance on these benchmarks? Even, with no direct data leakage a lot of these questions could be replicated with slight modifications and be used as training data, thus artificially inflating model performance on these benchmarks. Is there any way of verifying this apart from good faith?

eliebak

Open R1 org 3 days ago

Yes, that's an interesting question for sure, we will try to do it and of course we welcome contribution for this!

dvamvour

3 days ago

Yeah it would be interesting to do some ablation study to show how performance on these benchmarks would change as more samples from the same benchmarks are used for training.

hadarishav

2 days ago

we could do some contamination analysis of deepseek-r1, that should give us some idea.