LR\({}^{2}\)Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

LR\({}^{2}\)Bench is a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs. LR\({}^{2}\)Bench comprises 850 samples across six Constraint Satisfaction Problems (CSPs) where reflective reasoning is crucial for deriving solutions that meet all given constraints. Each type of task focuses on distinct constraint patterns, such as knowledge-based, logical, and spatial constraints, providing a comprehensive evaluation of diverse problem-solving scenarios.

Note: We have released the LR\({}^{2}\)Bench dataset here. If you're interested in the performance of your model or have any questions, please feel free to contact us at chenjianghao2022@ia.ac.cn. We will soon provide an automated evaluation system on leaderboard website here.