CorrectBench: A Benchmark of Self-Correction in LLMs

Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs

Guiyao Tie¹, Zenghui Yuan¹, Zeli Zhao¹, Chaoran Hu¹, Tianhe Gu¹, Ruihang Zhang¹, Sizhe Zhang¹, Junran Wu¹, Xiaoyue Tu¹, Ming Jin², Qingsong Wen³, Lixing Chen⁴, Pan Zhou¹, Lichao Sun⁵,

¹Huazhong University of Science and Technology, ²Griffith University, ³Squirrel Ai Learning, ⁴Shanghai Jiaotong University, ⁵Lehigh University

Introduction

The rapid advancement of large language models (LLMs), exemplified by GPT-3.5 and LLaMA 3, has precipitated a transformative shift in artificial intelligence (AI), yielding state-of-the-art performance across diverse tasks. Specifically, these tasks include content generation, natural language understanding, and complex decision-making, all of which have been revolutionized by the extensive pretraining and sophisticated architectures of LLMs. Notably, the introduction of frameworks like Chain-of-Thought (CoT) has further expanded LLM's capacity for multi-step reasoning, enabling them to tackle more intricate tasks. Despite these advancements, ensuring the reliability and accuracy of model outputs, especially for reasoning-intensive tasks, remains a formidable challenge. In response, recent works have focused on self-correction strategies aimed at refining LLMs' decision-making processes through iterative revision. Pioneering approaches such as RARR, Refiner, and CRITIC illustrate the potential of integrating feedback loops and corrective components into model architectures. However, these approaches often yield inconsistent gains across different tasks, prompting deeper questions about their capability of correction and generalizability.

Type	Method	HotpotQA (↑)	CS-QA (↑)	GPQA (↑)	GSM8K (↑)	AQUA (↑)	MATH (↑)	HumanEval (↑)
-	Base	80.76	79.96	18.56	86.46	61.23	75.12	72.71
CoT	83.29 (+2.53)	78.03 (-1.93)	16.52 (-2.04)	91.96 (+5.50)	60.24 (-0.99)	72.59 (-2.53)	60.10 (-12.61)
S1	RCI	79.67 (-1.09)	76.29 (-3.67)	19.98 (+1.42)	87.00 (+0.54)	67.12 (+5.89)	74.92 (-0.20)	67.46 (-5.25)
CoVe	83.04 (+2.28)	78.54 (-1.42)	37.41 (+18.85)	92.23 (+5.77)	71.12 (+9.89)	79.30 (+4.18)	76.96 (+4.25)
Self-Refine	85.49 (+4.73)	81.06 (+1.10)	40.69 (+22.13)	91.74 (+5.28)	69.46 (+8.23)	81.77 (+6.65)	-
Reflection-v1	69.52 (-11.24)	63.89 (-16.07)	19.25 (+0.69)	67.64 (-18.82)	48.33 (-12.90)	65.01 (-10.11)	-
S2	Reflection-v2	87.98 (+7.22)	82.21 (+2.25)	26.85 (+8.29)	89.87 (+3.41)	68.23 (+7.00)	81.36 (+6.24)	-
RARR	85.47 (+4.71)	80.57 (+0.61)	36.82 (+18.26)	88.92 (+2.46)	66.81 (+5.58)	82.78 (+7.66)	77.35 (+4.64)
RATT	79.59 (-1.17)	80.81 (+0.85)	25.90 (+7.34)	88.08 (+1.62)	68.06 (+6.83)	80.74 (+5.62)	73.44 (+0.73)
CRITIC	-	81.77 (+1.81)	-	77.46 (-9.00)	-	-	-
-	Average	83.54 (+2.78)	80.18 (+0.22)	31.28 (+12.72)	85.04 (-1.42)	68.47 (+7.24)	80.15 (+5.03)	73.80 (+1.09)

Type

Method

HotpotQA (↑)

CS-QA (↑)

GPQA (↑)

GSM8K (↑)

AQUA (↑)

MATH (↑)

HumanEval (↑)

Base

80.76

79.96

18.56

86.46

61.23

75.12

72.71

CoT

83.29 (+2.53)

78.03 (-1.93)

16.52 (-2.04)

91.96 (+5.50)

60.24 (-0.99)

72.59 (-2.53)

60.10 (-12.61)

RCI

79.67 (-1.09)

76.29 (-3.67)

19.98 (+1.42)

87.00 (+0.54)

67.12 (+5.89)

74.92 (-0.20)

67.46 (-5.25)

CoVe

83.04 (+2.28)

78.54 (-1.42)

37.41 (+18.85)

92.23 (+5.77)

71.12 (+9.89)

79.30 (+4.18)

76.96 (+4.25)

Self-Refine

85.49 (+4.73)

81.06 (+1.10)

40.69 (+22.13)

91.74 (+5.28)

69.46 (+8.23)

81.77 (+6.65)

Reflection-v1

69.52 (-11.24)

63.89 (-16.07)

19.25 (+0.69)

67.64 (-18.82)

48.33 (-12.90)

65.01 (-10.11)

Reflection-v2

87.98 (+7.22)

82.21 (+2.25)

26.85 (+8.29)

89.87 (+3.41)

68.23 (+7.00)

81.36 (+6.24)

RARR

85.47 (+4.71)

80.57 (+0.61)

36.82 (+18.26)

88.92 (+2.46)

66.81 (+5.58)

82.78 (+7.66)

77.35 (+4.64)

RATT

79.59 (-1.17)

80.81 (+0.85)

25.90 (+7.34)

88.08 (+1.62)

68.06 (+6.83)

80.74 (+5.62)

73.44 (+0.73)

CRITIC

81.77 (+1.81)

77.46 (-9.00)

Average

83.54 (+2.78)

80.18 (+0.22)

31.28 (+12.72)

85.04 (-1.42)

68.47 (+7.24)

80.15 (+5.03)

73.80 (+1.09)

Conclusion and Implications

The benchmark emphasizes self-correction as a valuable tool for enhancing LLM performance, with a clear need to balance reasoning capabilities and efficiency. Future research should optimize this trade-off for practical applications.

BibTeX

@inproceedings{tie2025correctbench, title={CorrectBench: A Benchmark of Self-Correction in LLMs}, author={Guiyao Tie and Zenghui Yuan and Zeli Zhao and Chaoran Hu and Tianhe Gu and Ruihang Zhang and Sizhe Zhang and Junran Wu and Xiaoyue Tu and Ming Jin and Qingsong Wen and Lixing Chen and Pan Zhou and Lichao Sun}, booktitle={Proceedings of the NeurIPS 2025 Datasets and Benchmarks Track}, year={2025}, }

Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs

Introduction

What is CorrectBench?

Self-Correction Types

Leaderboard

Evaluation and Results

Conclusion and Implications

BibTeX