MMMR Logo

Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs

Guiyao Tie1, Zenghui Yuan1, Zeli Zhao1, Chaoran Hu1, Tianhe Gu1, Ruihang Zhang1, Sizhe Zhang1, Junran Wu1, Xiaoyue Tu1, Ming Jin2, Qingsong Wen3, Lixing Chen4, Pan Zhou1, Lichao Sun5,
1Huazhong University of Science and Technology, 2Griffith University, 3Squirrel Ai Learning, 4Shanghai Jiaotong University, 5Lehigh University

This paper introduces CorrectBench, the first systematic benchmark for evaluating self-correction capabilities in large language models (LLMs) across three critical domains: commonsense reasoning, mathematical reasoning, and code generation. We systematically analyze three distinct correction paradigms:   1) Intrinsic self-correction through reflective prompting;   2) External verification via tool-augmented methods;   3) Fine-tuned correction models. Our empirical study demonstrates that while self-correction strategies achieve 5.2% accuracy gains on complex mathematical reasoning (MATH dataset), hybrid approaches incur significant efficiency costs (≈40% slower than baseline). Surprisingly, chain-of-thought prompting maintains competitive performance with 2.8× faster execution compared to DeepSeek-V3 with correction layers. These findings highlight the critical trade-off between reasoning accuracy and computational efficiency in LLM self-correction systems.

Introduction

The rapid advancement of large language models (LLMs), exemplified by GPT-3.5 and LLaMA 3, has precipitated a transformative shift in artificial intelligence (AI), yielding state-of-the-art performance across diverse tasks. Specifically, these tasks include content generation, natural language understanding, and complex decision-making, all of which have been revolutionized by the extensive pretraining and sophisticated architectures of LLMs. Notably, the introduction of frameworks like Chain-of-Thought (CoT) has further expanded LLM's capacity for multi-step reasoning, enabling them to tackle more intricate tasks. Despite these advancements, ensuring the reliability and accuracy of model outputs, especially for reasoning-intensive tasks, remains a formidable challenge. In response, recent works have focused on self-correction strategies aimed at refining LLMs' decision-making processes through iterative revision. Pioneering approaches such as RARR, Refiner, and CRITIC illustrate the potential of integrating feedback loops and corrective components into model architectures. However, these approaches often yield inconsistent gains across different tasks, prompting deeper questions about their capability of correction and generalizability.

What is CorrectBench?

CorrectBench is a systematically designed benchmark that quantifies the extent to which various correction methods improve model outputs in reasoning-intensive scenarios.CorrectBench characterizes self-correction along three principal dimensions: Task Scenario, Self-Correction Type, and LLM Type. The evaluation pipeline begins with selecting a specific task scenario and dataset, followed by applying a chosen correction method, and concludes with assessing the model's iterative self-correction process across diverse LLMs.

Data Overview

Self-Correction Types

S1

Intrinsic Correction

LLMs internally identify and correct errors without external tools. Features self-evaluation of reasoning steps.

S2

External Correction

Leverages external resources like knowledge bases and search tools to validate information.

S3

Fine-tuned Correction

Enhances performance through targeted fine-tuning. Requires training with specialized datasets.

DCoT SCORE SuperCorrect

Leaderboard

Type Method HotpotQA (↑) CS-QA (↑) GPQA (↑) GSM8K (↑) AQUA (↑) MATH (↑) HumanEval (↑)
- Base
80.76
79.96
18.56
86.46
61.23
75.12
72.71
CoT
83.29 (+2.53)
78.03 (-1.93)
16.52 (-2.04)
91.96 (+5.50)
60.24 (-0.99)
72.59 (-2.53)
60.10 (-12.61)
S1 RCI
79.67 (-1.09)
76.29 (-3.67)
19.98 (+1.42)
87.00 (+0.54)
67.12 (+5.89)
74.92 (-0.20)
67.46 (-5.25)
CoVe
83.04 (+2.28)
78.54 (-1.42)
37.41 (+18.85)
92.23 (+5.77)
71.12 (+9.89)
79.30 (+4.18)
76.96 (+4.25)
Self-Refine
85.49 (+4.73)
81.06 (+1.10)
40.69 (+22.13)
91.74 (+5.28)
69.46 (+8.23)
81.77 (+6.65)
-
Reflection-v1
69.52 (-11.24)
63.89 (-16.07)
19.25 (+0.69)
67.64 (-18.82)
48.33 (-12.90)
65.01 (-10.11)
-
S2 Reflection-v2
87.98 (+7.22)
82.21 (+2.25)
26.85 (+8.29)
89.87 (+3.41)
68.23 (+7.00)
81.36 (+6.24)
-
RARR
85.47 (+4.71)
80.57 (+0.61)
36.82 (+18.26)
88.92 (+2.46)
66.81 (+5.58)
82.78 (+7.66)
77.35 (+4.64)
RATT
79.59 (-1.17)
80.81 (+0.85)
25.90 (+7.34)
88.08 (+1.62)
68.06 (+6.83)
80.74 (+5.62)
73.44 (+0.73)
CRITIC
-
81.77 (+1.81)
-
77.46 (-9.00)
-
-
-
- Average
83.54 (+2.78)
80.18 (+0.22)
31.28 (+12.72)
85.04 (-1.42)
68.47 (+7.24)
80.15 (+5.03)
73.80 (+1.09)

Evaluation and Results

Findings indicate substantial accuracy improvements with self-correction, especially in mathematical and complex reasoning tasks. However, mixing methods may increase computational overhead significantly.

Evaluation Results

Conclusion and Implications

The benchmark emphasizes self-correction as a valuable tool for enhancing LLM performance, with a clear need to balance reasoning capabilities and efficiency. Future research should optimize this trade-off for practical applications.

Evaluation Results

BibTeX

@inproceedings{tie2025correctbench,
        title={CorrectBench: A Benchmark of Self-Correction in LLMs},
        author={Guiyao Tie and Zenghui Yuan and Zeli Zhao and Chaoran Hu and Tianhe Gu and Ruihang Zhang and Sizhe Zhang and Junran Wu and Xiaoyue Tu and Ming Jin and Qingsong Wen and Lixing Chen and Pan Zhou and Lichao Sun},
        booktitle={Proceedings of the NeurIPS 2025 Datasets and Benchmarks Track},
        year={2025},
      }