arxiv:2502.13820

Scoring Verifiers: Evaluating Synthetic Verification in Code and Reasoning

Published on Feb 19

Authors:

Abstract

Code verification has recently found great success as a critical component in training large scale reasoning models for coding. Synthetic techniques such as self-generated test cases and reward models provide a way to enhance code capabilities beyond predefined tests. Building on these advancements, we propose new benchmarks designed to systematically evaluate the impact of synthetic verification methods on assessing solution correctness. We introduce HE-R, HE-R+, MBPP-R, and MBPP-R+, which transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. Using these benchmarks, we analyze synthetic verification methods in standard, reasoning-based, and reward-based LLMs. Our results show that recent reasoning models significantly improve test case generation and that scaling test cases enhances verification accuracy.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.13820 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.13820 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.13820 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.