Running 535 535 Scaling test-time compute 📈 Enhance math problem solving by scaling test-time compute
TrustSQL: Benchmarking Text-to-SQL Reliability with Penalty-Based Scoring Paper • 2403.15879 • Published Mar 23, 2024 • 1
TrustSQL: Benchmarking Text-to-SQL Reliability with Penalty-Based Scoring Paper • 2403.15879 • Published Mar 23, 2024 • 1
DialSim: A Real-Time Simulator for Evaluating Long-Term Dialogue Understanding of Conversational Agents Paper • 2406.13144 • Published Jun 19, 2024 • 11
DialSim: A Real-Time Simulator for Evaluating Long-Term Dialogue Understanding of Conversational Agents Paper • 2406.13144 • Published Jun 19, 2024 • 11