pith. sign in

arxiv: 2606.02835 · v1 · pith:TVF44Z37new · submitted 2026-06-01 · 💻 cs.AI

Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

Pith reviewed 2026-06-28 14:12 UTC · model grok-4.3

classification 💻 cs.AI
keywords large reasoning modelsoverthinkingreasoning tracesprefix evaluationharmful overthinkingearly stoppingreasoning sufficiency
0
0 comments X

The pith

Stopping reasoning at the first correct prefix raises accuracy by up to 21% in large reasoning models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models produce explicit step-by-step traces and are assumed to benefit from longer chains, yet the paper examines what happens after the model first reaches a correct answer. It introduces a protocol that locates the shortest prefix of the trace containing the correct final answer, then compares continued generation against stopping there. Many tasks need far less reasoning than models normally output, and halting at the first correct prefix improves results over full traces. The extra steps often introduce logical drift or visual reinterpretation that moves the model away from the correct solution. The same pattern appears in both multimodal and language-only benchmarks, showing the core limit is not reasoning power but the inability to stop.

Core claim

Once a model generates the correct answer inside its reasoning trace, further reasoning frequently destabilizes that solution rather than refining it. The minimum reasoning budget required to reach correctness is often short, and forcing the model to stop at that first correct prefix produces accuracy gains of up to 21 percent over standard full-trace generation. Common early-stopping methods cut redundant steps but leave harmful deviations intact, which are driven mainly by logical drift and visual reinterpretation.

What carries the argument

The prefix-level trajectory evaluation protocol that marks the shortest prefix of a reasoning trace whose final answer is correct and uses that point as the minimum sufficient budget.

If this is right

  • Models are limited as much by when they stop as by how well they reason.
  • Early-stopping heuristics reduce harmless extra steps but leave harmful overthinking untouched.
  • Deviations after correctness arise chiefly from logical drift and visual reinterpretation.
  • The inability to stop at the right time appears in both multimodal and language-only settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Objective functions that explicitly reward termination at first correctness could reduce harmful drift during training.
  • Benchmarks that score only the final output may undervalue models that reach the answer early and then wander.
  • The same stopping problem could surface in other long-horizon generation tasks such as code synthesis or multi-step planning.

Load-bearing premise

Judging correctness on prefixes does not systematically mislabel the true first point at which the model has solved the problem.

What would settle it

An experiment that forces models to terminate exactly at the identified first-correct prefix and measures whether accuracy still exceeds the accuracy obtained from letting the model generate its full original trace.

Figures

Figures reproduced from arXiv: 2606.02835 by Davide Talon, Elisa Ricci, Massimiliano Mancini, Rahaf Aljundi, Simone Caldarella.

Figure 1
Figure 1. Figure 1: Performance averaged on LRMs. Actual Length is the model’s default behavior, No-CoT disables intermediate reasoning, and Instruct Model is the pre-reasoning instruction-tuned model. Finally, Optimal Length stops at the first correct prefix. The gap between Actual Length and Optimal Length shows that models often reason past correctness, making additional reasoning harmful. perspective has also been studied… view at source ↗
Figure 2
Figure 2. Figure 2: Average number of utterances across five multimodal models under Actual Length and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of overthinking types across response formats. Bars show the per￾centage of solved samples exhibiting ver￾bose versus harmful overthinking for multiple￾choice (MC) and free-form (FF) settings. 0 50 100 150 200 Reasoning Steps after τy 0.0 0.2 0.4 0.6 0.8 1.0 P(zi+1 = 1|zi = 1) Correctness Retention [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative correctness deviations. Each example shows a trajectory that first reaches the [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Robustness of the difficulty analysis across controlled sources of variation. The left [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cross-model Spearman Correlation in estimated [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Optimal Length scaling compared with standard test-time (Actual Length) scaling and Pass@K. Increasing test-time compute improves performance, but remains below Optimal Length, which stops each trajectory at the first correct prefix. The gap shows that models often already contain the correct answer before termination, but fail to stop before later reasoning deviates from correctness. C W Answer after i+1 … view at source ↗
Figure 10
Figure 10. Figure 10: Token-level statistics for utterance-based reasoning budgets. Left: distribution of the [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Average wasted budget in number of utterances per model per benchmark. DualMind [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Actual vs.optimal reasoning length for language-only LRMs across all models and benchmarks. Actual traces are substan￾tially longer than the first-correct prefixes, showing that language-only models also rea￾son far beyond the point at which the correct answer first becomes recoverable. 0 .25 .50 .75 1.0 Mean % of solved examples MC FF .73 .27 .60 .40 Verbose Harmful [PITH_FULL_IMAGE:figures/full_fig_p01… view at source ↗
Figure 14
Figure 14. Figure 14: Termination prompts used for prefix-level probing. Each prompt is appended to a partial [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt used by the answer extractor A. The variable model_trace denotes the raw generation produced by the evaluated model, either at full length or after prefix-level probing. The extractor returns only the concise final answer used for benchmark verification. Algorithm 1 PyTorch-style code for κˆ(x; F) # x = input problem # F = reasoning model # A = parser mapping output to a prediction # y = ground-tru… view at source ↗
Figure 16
Figure 16. Figure 16: Failure-analysis judge prompt. The judge compares the final incorrect trace against the [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
read the original abstract

Large Reasoning Models (LRMs) improve performance by generating explicit intermediate reasoning traces through increased test-time compute, yet the assumption that longer reasoning is consistently beneficial remains under-examined. While recent evidence shows that additional reasoning can lead models to overthink, we ask: "Once a model has reached the correct answer, does further reasoning refine the solution, or deviate from it?" To study the dynamics after correctness, we introduce a prefix-level trajectory evaluation protocol grounded in reasoning sufficiency, defining the minimum reasoning budget required for a model to first generate the correct answer. This allows us to disentangle verbose overthinking, where additional reasoning is redundant but harmless, from harmful overthinking, where continued reasoning destabilizes an already-correct trajectory. Starting from multimodal benchmarks, we find that many instances considered reasoning-intensive require surprisingly little reasoning. Moreover, stopping at the first correct prefix improves accuracy over standard reasoning up to 21%, revealing that current models are limited not only by their ability to reason, but also by their inability to stop at the right time. Furthermore, while common efficiency strategies like early stopping substantially reduce verbose overthinking (up to 50%), they fail to mitigate harmful overthinking. Failure analysis reveals that correctness deviations are mainly driven by logical drift and visual reinterpretation. Finally, we show that our findings generalize to language-only reasoning benchmarks, highlighting harmful overthinking as a broader reliability risk. Code available at https://simonecaldarella.github.io/thinking-past-the-answer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a prefix-level trajectory evaluation protocol to study harmful overthinking in Large Reasoning Models (LRMs). It claims that many multimodal reasoning tasks require surprisingly little reasoning to first reach correctness, that stopping at the first correct prefix yields accuracy gains of up to 21% over full trajectories, that common early-stopping methods reduce verbose but not harmful overthinking, and that the phenomenon generalizes to language-only benchmarks. Failure modes are attributed primarily to logical drift and visual reinterpretation.

Significance. If the protocol is robust, the work identifies a previously under-examined reliability failure mode in test-time scaling: models can destabilize correct solutions through continued reasoning. The empirical measurement on public benchmarks, the distinction between verbose and harmful overthinking, the demonstration that standard efficiency interventions do not address the latter, and the public code release are concrete contributions that could inform both evaluation protocols and training objectives for reasoning models.

major comments (3)
  1. [Methods / prefix-level trajectory evaluation protocol] Methods / prefix-level trajectory evaluation protocol: the paper provides no explicit description of the answer extractor applied to incomplete prefixes (regex for \boxed{}, LLM judge, final-token matching, or other). This detail is load-bearing for the headline 21% accuracy claim, because any systematic misclassification of early prefixes (especially under visual reinterpretation in multimodal traces) directly affects the measured gap between first-correct and full-trajectory accuracy.
  2. [Results on multimodal benchmarks] Results on multimodal benchmarks: the abstract states that stopping at the first correct prefix improves accuracy “up to 21%,” yet no table or section reports per-benchmark deltas, confidence intervals, or controls for multiple comparisons. Without these, it is impossible to assess whether the reported gain survives statistical scrutiny or benchmark-specific artifacts.
  3. [Failure analysis] Failure analysis: the claim that “correctness deviations are mainly driven by logical drift and visual reinterpretation” is presented without a quantitative breakdown (e.g., percentage of cases per failure mode) or a table of representative examples. This weakens the causal interpretation of harmful overthinking.
minor comments (2)
  1. [Abstract] The code link is given only in the abstract; a formal Data Availability or Code Availability statement with a persistent DOI or GitHub release tag would improve reproducibility.
  2. [Introduction / Methods] Notation for “reasoning sufficiency” and “first correct prefix” should be defined once in a dedicated subsection rather than introduced piecemeal across the abstract and results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. The comments highlight important areas for improving methodological transparency, statistical reporting, and the rigor of our failure analysis. We address each point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Methods / prefix-level trajectory evaluation protocol] the paper provides no explicit description of the answer extractor applied to incomplete prefixes (regex for \boxed{}, LLM judge, final-token matching, or other). This detail is load-bearing for the headline 21% accuracy claim, because any systematic misclassification of early prefixes (especially under visual reinterpretation in multimodal traces) directly affects the measured gap between first-correct and full-trajectory accuracy.

    Authors: We agree that an explicit description of the answer extractor is necessary for reproducibility and to substantiate the accuracy gains. The current manuscript describes the overall prefix-level protocol but does not detail the extractor implementation. In the revised version we will add a dedicated subsection in Methods that specifies: (1) primary use of regex matching on \boxed{} and final-answer formats, (2) fallback to an LLM judge with a fixed prompt and temperature for ambiguous multimodal cases, and (3) human verification on a 200-sample subset to quantify extractor error rates. This addition will directly address concerns about misclassification under visual reinterpretation. revision: yes

  2. Referee: [Results on multimodal benchmarks] the abstract states that stopping at the first correct prefix improves accuracy “up to 21%,” yet no table or section reports per-benchmark deltas, confidence intervals, or controls for multiple comparisons. Without these, it is impossible to assess whether the reported gain survives statistical scrutiny or benchmark-specific artifacts.

    Authors: We acknowledge that the abstract reports only the maximum observed gain without per-benchmark detail or uncertainty estimates. The full manuscript contains aggregate results but lacks the requested breakdown. In revision we will add a new results table (and corresponding appendix) that reports, for each multimodal benchmark: (i) accuracy of full trajectories, (ii) accuracy when stopping at the first correct prefix, (iii) absolute and relative deltas, and (iv) 95% bootstrap confidence intervals. We will also state that the primary comparisons were pre-specified and therefore no multiple-comparison correction was applied. These changes will allow readers to evaluate statistical robustness directly. revision: yes

  3. Referee: [Failure analysis] the claim that “correctness deviations are mainly driven by logical drift and visual reinterpretation” is presented without a quantitative breakdown (e.g., percentage of cases per failure mode) or a table of representative examples. This weakens the causal interpretation of harmful overthinking.

    Authors: We agree that the current failure analysis is primarily qualitative. While the manuscript identifies the two dominant modes through manual inspection, it does not provide counts or examples. In the revised manuscript we will add: (1) a quantitative breakdown table based on annotation of 150 randomly sampled failure cases, reporting the percentage attributed to logical drift, visual reinterpretation, and other categories, and (2) a supplementary table with 2–3 representative trace excerpts per mode. These additions will ground the causal claims with concrete evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical protocol yields measured accuracy deltas on public benchmarks

full rationale

The paper introduces a prefix-level trajectory evaluation protocol as an operational definition for identifying the first correct answer in reasoning traces, then reports empirical accuracy improvements (up to 21%) when stopping at that point versus full trajectories. These are direct measurements on multimodal and language benchmarks, not quantities derived from self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations. No derivation chain reduces the headline result to its inputs by construction; the protocol is a measurement tool whose validity is separate from circularity concerns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that benchmark answers can be unambiguously matched to model prefixes and that the selected multimodal and language benchmarks are representative of reasoning tasks where overthinking occurs.

axioms (1)
  • domain assumption The definition of the first correct prefix can be determined unambiguously from model outputs on the chosen benchmarks.
    The entire evaluation protocol depends on reliably identifying when the model first produces the correct answer.

pith-pipeline@v0.9.1-grok · 5804 in / 1250 out tokens · 26512 ms · 2026-06-28T14:12:28.782950+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 8 linked inside Pith

  1. [1]

    Intern-s1: A scientific multimodal foundation model.arXiv:2508.15763, 2025

    Lei Bai, Zhongrui Cai, Yuhang Cao, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, et al. Intern-s1: A scientific multimodal foundation model.arXiv:2508.15763, 2025

  2. [2]

    Are we on the right way for evaluating large vision-language models?NeurIPS, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?NeurIPS, 2024

  3. [3]

    Evaluating large language models trained on code.arXiv:2107.03374, 2021

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv:2107.03374, 2021

  4. [4]

    Do not think that much for 2+ 3=? on the overthinking of o1-like llms.ICML, 2025

    Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.ICML, 2025

  5. [5]

    Training verifiers to solve math word problems.arXiv:2110.14168, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv:2110.14168, 2021

  6. [6]

    The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks.arXiv:2502.08235, 2025

    Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, et al. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks.arXiv:2502.08235, 2025

  7. [7]

    S-grpo: Early exit via reinforcement learning in reasoning models.NeurIPS, 2025

    Muzhi Dai, Chenxu Yang, and Qingyi Si. S-grpo: Early exit via reinforcement learning in reasoning models.NeurIPS, 2025

  8. [8]

    Efficiently scaling llm reasoning with certaindex.NeurIPS, 2025

    Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Yonghao Zhuang, Yian Ma, Aurick Qiao, Tajana Rosing, Ion Stoica, et al. Efficiently scaling llm reasoning with certaindex.NeurIPS, 2025

  9. [9]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 2025

  10. [10]

    Measuring mathematical problem solving with the math dataset.NeurIPS, 2021

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.NeurIPS, 2021

  11. [11]

    Openai o1 system card.arXiv:2412.16720, 2024

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv:2412.16720, 2024

  12. [12]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016

  13. [13]

    Chain of thought compression: A theoritical analysis.arXiv preprint arXiv:2601.21576, 2026

    Juncai Li, Ru Li, Yuxiang Zhou, Boxiang Ma, and Jeff Z Pan. Chain of thought compression: A theoritical analysis.arXiv preprint arXiv:2601.21576, 2026

  14. [14]

    To think or not to think: A study of thinking in rule-based visual reinforcement fine-tuning

    Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, Haoquan Zhang, Wang Bill Zhu, and Kaipeng Zhang. To think or not to think: A study of thinking in rule-based visual reinforcement fine-tuning. InNeurIPS, 2025

  15. [15]

    Learning to think fast and slow for visual language models.arXiv:2511.16670, 2025

    Chenyu Lin, Cheng Chi, Jinlin Wu, Sharon Li, and Kaiyang Zhou. Learning to think fast and slow for visual language models.arXiv:2511.16670, 2025

  16. [16]

    Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.NeurIPS, 2023

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.NeurIPS, 2023

  17. [17]

    Qfft, question-free fine-tuning for adaptive reasoning.NeurIPS, 2025

    Wanlong Liu, Junxiao Xu, Fei Yu, Yukang Lin, Ke Ji, Wenyu Chen, Yan Xu, Yasheng Wang, Lifeng Shang, and Benyou Wang. Qfft, question-free fine-tuning for adaptive reasoning.NeurIPS, 2025. 10

  18. [18]

    Efficient inference for large reasoning models: A survey.arXiv:2503.23077, 2025

    Yue Liu, Jiaying Wu, Yufei He, Ruihan Gong, Jun Xia, Liang Li, Hongcheng Gao, Hongyu Chen, Baolong Bi, Jiaheng Zhang, et al. Efficient inference for large reasoning models: A survey.arXiv:2503.23077, 2025

  19. [19]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.ICLR, 2023

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.ICLR, 2023

  20. [20]

    Reasoning models can be effective without thinking.arXiv:2504.09858, 2025

    Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning models can be effective without thinking.arXiv:2504.09858, 2025

  21. [21]

    Comparison of the predicted and observed secondary structure of t4 phage lysozyme

    Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 1975

  22. [23]

    Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv:2503.07365, 2025

  23. [24]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InEMNLP, 2025

  24. [25]

    Gpqa: A graduate-level google-proof q&a benchmark.COLM, 2023

    David Rein, Betty Hou, Amos Stock, William Liu, Ayan Mandlekar, Arian Ghodsi, Dara Bahri, Fan Zhou, Akshay Mehra, Eunice Yiu, et al. Gpqa: A graduate-level google-proof q&a benchmark.COLM, 2023

  25. [26]

    Dast: Difficulty-adaptive slow-thinking for large reasoning models

    Yi Shen, Jian Zhang, Jieyun Huang, Shuming Shi, Wenjing Zhang, Jiangze Yan, Ning Wang, Kai Wang, Zhaoxiang Liu, and Shiguo Lian. Dast: Difficulty-adaptive slow-thinking for large reasoning models. In EMNLP, pages 2322–2331, 2025

  26. [27]

    The proof and measurement of association between two things

    Charles Spearman. The proof and measurement of association between two things. 1961

  27. [28]

    Stop overthinking: A survey on efficient reasoning for large language models.arXiv:2503.16419, 2025

    Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv:2503.16419, 2025

  28. [29]

    Confidence improves self-consistency in llms

    Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms. InFindings-ACL 2025, 2025

  29. [30]

    Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.NeurIPS, 2025

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.NeurIPS, 2025

  30. [31]

    Measuring multimodal mathematical reasoning with math-vision dataset.NeurIPS, 2024

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.NeurIPS, 2024

  31. [32]

    Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv:2409.12191, 2024

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv:2409.12191, 2024

  32. [33]

    Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning

    Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, and Kan Li. Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning. In Findings-NAACL, 2025

  33. [35]

    Is it thinking or cheating? detecting implicit reward hacking by measuring reasoning effort

    Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, and He He. Is it thinking or cheating? detecting implicit reward hacking by measuring reasoning effort. InICLR, 2026

  34. [36]

    Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement

    Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement. InNeurIPS, 2025

  35. [37]

    Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.NeurIPS, 2025

    Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.NeurIPS, 2025. 11

  36. [38]

    When more is less: Understanding chain-of-thought length in llms.ICLR, 2026

    Yuyang Wu, Yifei Wang, Ziyu Ye, Tianqi Du, Stefanie Jegelka, and Yisen Wang. When more is less: Understanding chain-of-thought length in llms.ICLR, 2026

  37. [39]

    Fast-slow thinking GRPO for large vision-language model reasoning

    Wenyi Xiao and Leilei Gan. Fast-slow thinking GRPO for large vision-language model reasoning. In NeurIPS, 2025

  38. [40]

    Qwen3 technical report.arXiv:2505.09388, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv:2505.09388, 2025

  39. [41]

    Dynamic early exit in reasoning models.ICLR, 2026

    Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning models.ICLR, 2026

  40. [42]

    Demystifying long chain-of- thought reasoning

    Shiming Yang, Yuxuan Tong, Xinyao Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of- thought reasoning. InICML, 2025

  41. [43]

    Reasoning models know when they’re right: Probing hidden states for self-verification.COLM, 2025

    Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Reasoning models know when they’re right: Probing hidden states for self-verification.COLM, 2025

  42. [44]

    Adaptthink: Reasoning models can learn when to think

    Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. Adaptthink: Reasoning models can learn when to think. InEMNLP, 2025

  43. [45]

    R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization

    Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. InICCV, 2025

  44. [46]

    R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.ICCV, 2025

    Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.ICCV, 2025

  45. [47]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InECCV, 2024

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InECCV, 2024

  46. [48]

    Instruction tuning for large language models: A survey.ACM, 2026

    Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Guoyin Wang, et al. Instruction tuning for large language models: A survey.ACM, 2026

  47. [49]

    American invitational mathematics examination (aime) 2025

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025. https: //huggingface.co/datasets/math-ai/aime25, 2025

  48. [50]

    optimized

    Yuhui Zhang, Yuchang Su, Yiming Liu, Xiaohan Wang, James Burgess, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei, et al. Automated generation of challenging multiple-choice questions for vision language model evaluation. InCVPR, 2025. 12 Supplementary Material Overview This appendix is organized in four macro blocks complementing th...

  49. [51]

    Compare only what changed after the last-correct prefix

  50. [52]

    Identify the main failure mode introduced by the final/full trace

  51. [53]

    If an image is provided, decide whether the added suffix hallucinates or misreads visual evidence

  52. [54]

    Choose the best available category even when the drift is ambiguous

  53. [55]

    category

    Ignore the standard forced final-answer probe suffix. Allowed categories: {categories} Return only valid JSON: { "category": "one_allowed_category", "secondary_categories": ["zero_or_more_allowed_categories"], "severity": 0_to_100_integer, "went_wrong": "short explanation", "evidence": "short quote or paraphrase from the added/final trace", "example": "mi...