Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

Davide Talon; Elisa Ricci; Massimiliano Mancini; Rahaf Aljundi; Simone Caldarella

arxiv: 2606.02835 · v1 · pith:TVF44Z37new · submitted 2026-06-01 · 💻 cs.AI

Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

Simone Caldarella , Davide Talon , Rahaf Aljundi , Elisa Ricci , Massimiliano Mancini This is my paper

Pith reviewed 2026-06-28 14:12 UTC · model grok-4.3

classification 💻 cs.AI

keywords large reasoning modelsoverthinkingreasoning tracesprefix evaluationharmful overthinkingearly stoppingreasoning sufficiency

0 comments

The pith

Stopping reasoning at the first correct prefix raises accuracy by up to 21% in large reasoning models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models produce explicit step-by-step traces and are assumed to benefit from longer chains, yet the paper examines what happens after the model first reaches a correct answer. It introduces a protocol that locates the shortest prefix of the trace containing the correct final answer, then compares continued generation against stopping there. Many tasks need far less reasoning than models normally output, and halting at the first correct prefix improves results over full traces. The extra steps often introduce logical drift or visual reinterpretation that moves the model away from the correct solution. The same pattern appears in both multimodal and language-only benchmarks, showing the core limit is not reasoning power but the inability to stop.

Core claim

Once a model generates the correct answer inside its reasoning trace, further reasoning frequently destabilizes that solution rather than refining it. The minimum reasoning budget required to reach correctness is often short, and forcing the model to stop at that first correct prefix produces accuracy gains of up to 21 percent over standard full-trace generation. Common early-stopping methods cut redundant steps but leave harmful deviations intact, which are driven mainly by logical drift and visual reinterpretation.

What carries the argument

The prefix-level trajectory evaluation protocol that marks the shortest prefix of a reasoning trace whose final answer is correct and uses that point as the minimum sufficient budget.

If this is right

Models are limited as much by when they stop as by how well they reason.
Early-stopping heuristics reduce harmless extra steps but leave harmful overthinking untouched.
Deviations after correctness arise chiefly from logical drift and visual reinterpretation.
The inability to stop at the right time appears in both multimodal and language-only settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Objective functions that explicitly reward termination at first correctness could reduce harmful drift during training.
Benchmarks that score only the final output may undervalue models that reach the answer early and then wander.
The same stopping problem could surface in other long-horizon generation tasks such as code synthesis or multi-step planning.

Load-bearing premise

Judging correctness on prefixes does not systematically mislabel the true first point at which the model has solved the problem.

What would settle it

An experiment that forces models to terminate exactly at the identified first-correct prefix and measures whether accuracy still exceeds the accuracy obtained from letting the model generate its full original trace.

Figures

Figures reproduced from arXiv: 2606.02835 by Davide Talon, Elisa Ricci, Massimiliano Mancini, Rahaf Aljundi, Simone Caldarella.

**Figure 1.** Figure 1: Performance averaged on LRMs. Actual Length is the model’s default behavior, No-CoT disables intermediate reasoning, and Instruct Model is the pre-reasoning instruction-tuned model. Finally, Optimal Length stops at the first correct prefix. The gap between Actual Length and Optimal Length shows that models often reason past correctness, making additional reasoning harmful. perspective has also been studied… view at source ↗

**Figure 2.** Figure 2: Average number of utterances across five multimodal models under Actual Length and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of overthinking types across response formats. Bars show the percentage of solved samples exhibiting verbose versus harmful overthinking for multiplechoice (MC) and free-form (FF) settings. 0 50 100 150 200 Reasoning Steps after τy 0.0 0.2 0.4 0.6 0.8 1.0 P(zi+1 = 1|zi = 1) Correctness Retention [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Representative correctness deviations. Each example shows a trajectory that first reaches the [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Robustness of the difficulty analysis across controlled sources of variation. The left [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-model Spearman Correlation in estimated [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Optimal Length scaling compared with standard test-time (Actual Length) scaling and Pass@K. Increasing test-time compute improves performance, but remains below Optimal Length, which stops each trajectory at the first correct prefix. The gap shows that models often already contain the correct answer before termination, but fail to stop before later reasoning deviates from correctness. C W Answer after i+1 … view at source ↗

**Figure 10.** Figure 10: Token-level statistics for utterance-based reasoning budgets. Left: distribution of the [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Average wasted budget in number of utterances per model per benchmark. DualMind [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Actual vs.optimal reasoning length for language-only LRMs across all models and benchmarks. Actual traces are substantially longer than the first-correct prefixes, showing that language-only models also reason far beyond the point at which the correct answer first becomes recoverable. 0 .25 .50 .75 1.0 Mean % of solved examples MC FF .73 .27 .60 .40 Verbose Harmful [PITH_FULL_IMAGE:figures/full_fig_p01… view at source ↗

**Figure 14.** Figure 14: Termination prompts used for prefix-level probing. Each prompt is appended to a partial [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt used by the answer extractor A. The variable model_trace denotes the raw generation produced by the evaluated model, either at full length or after prefix-level probing. The extractor returns only the concise final answer used for benchmark verification. Algorithm 1 PyTorch-style code for κˆ(x; F) # x = input problem # F = reasoning model # A = parser mapping output to a prediction # y = ground-tru… view at source ↗

**Figure 16.** Figure 16: Failure-analysis judge prompt. The judge compares the final incorrect trace against the [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

read the original abstract

Large Reasoning Models (LRMs) improve performance by generating explicit intermediate reasoning traces through increased test-time compute, yet the assumption that longer reasoning is consistently beneficial remains under-examined. While recent evidence shows that additional reasoning can lead models to overthink, we ask: "Once a model has reached the correct answer, does further reasoning refine the solution, or deviate from it?" To study the dynamics after correctness, we introduce a prefix-level trajectory evaluation protocol grounded in reasoning sufficiency, defining the minimum reasoning budget required for a model to first generate the correct answer. This allows us to disentangle verbose overthinking, where additional reasoning is redundant but harmless, from harmful overthinking, where continued reasoning destabilizes an already-correct trajectory. Starting from multimodal benchmarks, we find that many instances considered reasoning-intensive require surprisingly little reasoning. Moreover, stopping at the first correct prefix improves accuracy over standard reasoning up to 21%, revealing that current models are limited not only by their ability to reason, but also by their inability to stop at the right time. Furthermore, while common efficiency strategies like early stopping substantially reduce verbose overthinking (up to 50%), they fail to mitigate harmful overthinking. Failure analysis reveals that correctness deviations are mainly driven by logical drift and visual reinterpretation. Finally, we show that our findings generalize to language-only reasoning benchmarks, highlighting harmful overthinking as a broader reliability risk. Code available at https://simonecaldarella.github.io/thinking-past-the-answer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Stopping early at the first correct prefix lifts accuracy up to 21% because models often drift after they already have the answer.

read the letter

The main point here is that large reasoning models reach the right answer early on many problems but then keep going and end up wrong. Stopping at the first correct prefix recovers up to 21% accuracy compared with letting them run to the end. The paper separates this harmful overthinking from simple verbose overthinking and shows that standard early-stopping tricks cut the verbose kind but not the harmful kind.

What is new is the prefix-level protocol that tracks the minimum reasoning budget needed for first correctness and then measures what happens on the rest of the trace. They apply it first to multimodal benchmarks, find that many tasks need surprisingly little reasoning, and then check that the pattern holds on text-only ones. The failure cases they break down—logical drift and visual reinterpretation—line up with the numbers they report.

The 21% figure is the headline result, but it rests on how they label a prefix as correct. The abstract does not detail the answer extractor used on incomplete traces, and that choice matters a lot for multimodal data where formatting may be partial. If the extractor is too strict or too loose, the measured gap between early-stop and full traces could shift. The stress-test concern about bias in prefix judgment is therefore worth checking against the actual code and methods section.

The work is empirical and the code is released, so the measurements can be inspected. It is aimed at people who build or deploy reasoning models and want to understand when extra test-time compute stops helping. The central claim holds up on its own terms once the protocol details are verified, so it deserves a serious referee rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces a prefix-level trajectory evaluation protocol to study harmful overthinking in Large Reasoning Models (LRMs). It claims that many multimodal reasoning tasks require surprisingly little reasoning to first reach correctness, that stopping at the first correct prefix yields accuracy gains of up to 21% over full trajectories, that common early-stopping methods reduce verbose but not harmful overthinking, and that the phenomenon generalizes to language-only benchmarks. Failure modes are attributed primarily to logical drift and visual reinterpretation.

Significance. If the protocol is robust, the work identifies a previously under-examined reliability failure mode in test-time scaling: models can destabilize correct solutions through continued reasoning. The empirical measurement on public benchmarks, the distinction between verbose and harmful overthinking, the demonstration that standard efficiency interventions do not address the latter, and the public code release are concrete contributions that could inform both evaluation protocols and training objectives for reasoning models.

major comments (3)

[Methods / prefix-level trajectory evaluation protocol] Methods / prefix-level trajectory evaluation protocol: the paper provides no explicit description of the answer extractor applied to incomplete prefixes (regex for \boxed{}, LLM judge, final-token matching, or other). This detail is load-bearing for the headline 21% accuracy claim, because any systematic misclassification of early prefixes (especially under visual reinterpretation in multimodal traces) directly affects the measured gap between first-correct and full-trajectory accuracy.
[Results on multimodal benchmarks] Results on multimodal benchmarks: the abstract states that stopping at the first correct prefix improves accuracy “up to 21%,” yet no table or section reports per-benchmark deltas, confidence intervals, or controls for multiple comparisons. Without these, it is impossible to assess whether the reported gain survives statistical scrutiny or benchmark-specific artifacts.
[Failure analysis] Failure analysis: the claim that “correctness deviations are mainly driven by logical drift and visual reinterpretation” is presented without a quantitative breakdown (e.g., percentage of cases per failure mode) or a table of representative examples. This weakens the causal interpretation of harmful overthinking.

minor comments (2)

[Abstract] The code link is given only in the abstract; a formal Data Availability or Code Availability statement with a persistent DOI or GitHub release tag would improve reproducibility.
[Introduction / Methods] Notation for “reasoning sufficiency” and “first correct prefix” should be defined once in a dedicated subsection rather than introduced piecemeal across the abstract and results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. The comments highlight important areas for improving methodological transparency, statistical reporting, and the rigor of our failure analysis. We address each point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Methods / prefix-level trajectory evaluation protocol] the paper provides no explicit description of the answer extractor applied to incomplete prefixes (regex for \boxed{}, LLM judge, final-token matching, or other). This detail is load-bearing for the headline 21% accuracy claim, because any systematic misclassification of early prefixes (especially under visual reinterpretation in multimodal traces) directly affects the measured gap between first-correct and full-trajectory accuracy.

Authors: We agree that an explicit description of the answer extractor is necessary for reproducibility and to substantiate the accuracy gains. The current manuscript describes the overall prefix-level protocol but does not detail the extractor implementation. In the revised version we will add a dedicated subsection in Methods that specifies: (1) primary use of regex matching on \boxed{} and final-answer formats, (2) fallback to an LLM judge with a fixed prompt and temperature for ambiguous multimodal cases, and (3) human verification on a 200-sample subset to quantify extractor error rates. This addition will directly address concerns about misclassification under visual reinterpretation. revision: yes
Referee: [Results on multimodal benchmarks] the abstract states that stopping at the first correct prefix improves accuracy “up to 21%,” yet no table or section reports per-benchmark deltas, confidence intervals, or controls for multiple comparisons. Without these, it is impossible to assess whether the reported gain survives statistical scrutiny or benchmark-specific artifacts.

Authors: We acknowledge that the abstract reports only the maximum observed gain without per-benchmark detail or uncertainty estimates. The full manuscript contains aggregate results but lacks the requested breakdown. In revision we will add a new results table (and corresponding appendix) that reports, for each multimodal benchmark: (i) accuracy of full trajectories, (ii) accuracy when stopping at the first correct prefix, (iii) absolute and relative deltas, and (iv) 95% bootstrap confidence intervals. We will also state that the primary comparisons were pre-specified and therefore no multiple-comparison correction was applied. These changes will allow readers to evaluate statistical robustness directly. revision: yes
Referee: [Failure analysis] the claim that “correctness deviations are mainly driven by logical drift and visual reinterpretation” is presented without a quantitative breakdown (e.g., percentage of cases per failure mode) or a table of representative examples. This weakens the causal interpretation of harmful overthinking.

Authors: We agree that the current failure analysis is primarily qualitative. While the manuscript identifies the two dominant modes through manual inspection, it does not provide counts or examples. In the revised manuscript we will add: (1) a quantitative breakdown table based on annotation of 150 randomly sampled failure cases, reporting the percentage attributed to logical drift, visual reinterpretation, and other categories, and (2) a supplementary table with 2–3 representative trace excerpts per mode. These additions will ground the causal claims with concrete evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical protocol yields measured accuracy deltas on public benchmarks

full rationale

The paper introduces a prefix-level trajectory evaluation protocol as an operational definition for identifying the first correct answer in reasoning traces, then reports empirical accuracy improvements (up to 21%) when stopping at that point versus full trajectories. These are direct measurements on multimodal and language benchmarks, not quantities derived from self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations. No derivation chain reduces the headline result to its inputs by construction; the protocol is a measurement tool whose validity is separate from circularity concerns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that benchmark answers can be unambiguously matched to model prefixes and that the selected multimodal and language benchmarks are representative of reasoning tasks where overthinking occurs.

axioms (1)

domain assumption The definition of the first correct prefix can be determined unambiguously from model outputs on the chosen benchmarks.
The entire evaluation protocol depends on reliably identifying when the model first produces the correct answer.

pith-pipeline@v0.9.1-grok · 5804 in / 1250 out tokens · 26512 ms · 2026-06-28T14:12:28.782950+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 8 linked inside Pith

[1]

Intern-s1: A scientific multimodal foundation model.arXiv:2508.15763, 2025

Lei Bai, Zhongrui Cai, Yuhang Cao, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, et al. Intern-s1: A scientific multimodal foundation model.arXiv:2508.15763, 2025

arXiv 2025
[2]

Are we on the right way for evaluating large vision-language models?NeurIPS, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?NeurIPS, 2024

2024
[3]

Evaluating large language models trained on code.arXiv:2107.03374, 2021

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv:2107.03374, 2021

Pith/arXiv arXiv 2021
[4]

Do not think that much for 2+ 3=? on the overthinking of o1-like llms.ICML, 2025

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.ICML, 2025

2025
[5]

Training verifiers to solve math word problems.arXiv:2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021
[6]

The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks.arXiv:2502.08235, 2025

Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, et al. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks.arXiv:2502.08235, 2025

arXiv 2025
[7]

S-grpo: Early exit via reinforcement learning in reasoning models.NeurIPS, 2025

Muzhi Dai, Chenxu Yang, and Qingyi Si. S-grpo: Early exit via reinforcement learning in reasoning models.NeurIPS, 2025

2025
[8]

Efficiently scaling llm reasoning with certaindex.NeurIPS, 2025

Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Yonghao Zhuang, Yian Ma, Aurick Qiao, Tajana Rosing, Ion Stoica, et al. Efficiently scaling llm reasoning with certaindex.NeurIPS, 2025

2025
[9]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 2025

2025
[10]

Measuring mathematical problem solving with the math dataset.NeurIPS, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.NeurIPS, 2021

2021
[11]

Openai o1 system card.arXiv:2412.16720, 2024

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv:2412.16720, 2024

Pith/arXiv arXiv 2024
[12]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016

2016
[13]

Chain of thought compression: A theoritical analysis.arXiv preprint arXiv:2601.21576, 2026

Juncai Li, Ru Li, Yuxiang Zhou, Boxiang Ma, and Jeff Z Pan. Chain of thought compression: A theoritical analysis.arXiv preprint arXiv:2601.21576, 2026

Pith/arXiv arXiv 2026
[14]

To think or not to think: A study of thinking in rule-based visual reinforcement fine-tuning

Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, Haoquan Zhang, Wang Bill Zhu, and Kaipeng Zhang. To think or not to think: A study of thinking in rule-based visual reinforcement fine-tuning. InNeurIPS, 2025

2025
[15]

Learning to think fast and slow for visual language models.arXiv:2511.16670, 2025

Chenyu Lin, Cheng Chi, Jinlin Wu, Sharon Li, and Kaiyang Zhou. Learning to think fast and slow for visual language models.arXiv:2511.16670, 2025

arXiv 2025
[16]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.NeurIPS, 2023

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.NeurIPS, 2023

2023
[17]

Qfft, question-free fine-tuning for adaptive reasoning.NeurIPS, 2025

Wanlong Liu, Junxiao Xu, Fei Yu, Yukang Lin, Ke Ji, Wenyu Chen, Yan Xu, Yasheng Wang, Lifeng Shang, and Benyou Wang. Qfft, question-free fine-tuning for adaptive reasoning.NeurIPS, 2025. 10

2025
[18]

Efficient inference for large reasoning models: A survey.arXiv:2503.23077, 2025

Yue Liu, Jiaying Wu, Yufei He, Ruihan Gong, Jun Xia, Liang Li, Hongcheng Gao, Hongyu Chen, Baolong Bi, Jiaheng Zhang, et al. Efficient inference for large reasoning models: A survey.arXiv:2503.23077, 2025

arXiv 2025
[19]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.ICLR, 2023

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.ICLR, 2023

2023
[20]

Reasoning models can be effective without thinking.arXiv:2504.09858, 2025

Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning models can be effective without thinking.arXiv:2504.09858, 2025

arXiv 2025
[21]

Comparison of the predicted and observed secondary structure of t4 phage lysozyme

Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 1975

1975
[23]

Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv:2503.07365, 2025

Pith/arXiv arXiv 2025
[24]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InEMNLP, 2025

2025
[25]

Gpqa: A graduate-level google-proof q&a benchmark.COLM, 2023

David Rein, Betty Hou, Amos Stock, William Liu, Ayan Mandlekar, Arian Ghodsi, Dara Bahri, Fan Zhou, Akshay Mehra, Eunice Yiu, et al. Gpqa: A graduate-level google-proof q&a benchmark.COLM, 2023

2023
[26]

Dast: Difficulty-adaptive slow-thinking for large reasoning models

Yi Shen, Jian Zhang, Jieyun Huang, Shuming Shi, Wenjing Zhang, Jiangze Yan, Ning Wang, Kai Wang, Zhaoxiang Liu, and Shiguo Lian. Dast: Difficulty-adaptive slow-thinking for large reasoning models. In EMNLP, pages 2322–2331, 2025

2025
[27]

The proof and measurement of association between two things

Charles Spearman. The proof and measurement of association between two things. 1961

1961
[28]

Stop overthinking: A survey on efficient reasoning for large language models.arXiv:2503.16419, 2025

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv:2503.16419, 2025

Pith/arXiv arXiv 2025
[29]

Confidence improves self-consistency in llms

Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms. InFindings-ACL 2025, 2025

2025
[30]

Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.NeurIPS, 2025

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.NeurIPS, 2025

2025
[31]

Measuring multimodal mathematical reasoning with math-vision dataset.NeurIPS, 2024

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.NeurIPS, 2024

2024
[32]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv:2409.12191, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv:2409.12191, 2024

Pith/arXiv arXiv 2024
[33]

Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning

Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, and Kan Li. Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning. In Findings-NAACL, 2025

2025
[35]

Is it thinking or cheating? detecting implicit reward hacking by measuring reasoning effort

Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, and He He. Is it thinking or cheating? detecting implicit reward hacking by measuring reasoning effort. InICLR, 2026

2026
[36]

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement

Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement. InNeurIPS, 2025

2025
[37]

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.NeurIPS, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.NeurIPS, 2025. 11

2025
[38]

When more is less: Understanding chain-of-thought length in llms.ICLR, 2026

Yuyang Wu, Yifei Wang, Ziyu Ye, Tianqi Du, Stefanie Jegelka, and Yisen Wang. When more is less: Understanding chain-of-thought length in llms.ICLR, 2026

2026
[39]

Fast-slow thinking GRPO for large vision-language model reasoning

Wenyi Xiao and Leilei Gan. Fast-slow thinking GRPO for large vision-language model reasoning. In NeurIPS, 2025

2025
[40]

Qwen3 technical report.arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[41]

Dynamic early exit in reasoning models.ICLR, 2026

Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning models.ICLR, 2026

2026
[42]

Demystifying long chain-of- thought reasoning

Shiming Yang, Yuxuan Tong, Xinyao Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of- thought reasoning. InICML, 2025

2025
[43]

Reasoning models know when they’re right: Probing hidden states for self-verification.COLM, 2025

Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Reasoning models know when they’re right: Probing hidden states for self-verification.COLM, 2025

2025
[44]

Adaptthink: Reasoning models can learn when to think

Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. Adaptthink: Reasoning models can learn when to think. InEMNLP, 2025

2025
[45]

R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. InICCV, 2025

2025
[46]

R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.ICCV, 2025

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.ICCV, 2025

2025
[47]

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InECCV, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InECCV, 2024

2024
[48]

Instruction tuning for large language models: A survey.ACM, 2026

Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Guoyin Wang, et al. Instruction tuning for large language models: A survey.ACM, 2026

2026
[49]

American invitational mathematics examination (aime) 2025

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025. https: //huggingface.co/datasets/math-ai/aime25, 2025

2025
[50]

optimized

Yuhui Zhang, Yuchang Su, Yiming Liu, Xiaohan Wang, James Burgess, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei, et al. Automated generation of challenging multiple-choice questions for vision language model evaluation. InCVPR, 2025. 12 Supplementary Material Overview This appendix is organized in four macro blocks complementing th...

2025
[51]

Compare only what changed after the last-correct prefix
[52]

Identify the main failure mode introduced by the final/full trace
[53]

If an image is provided, decide whether the added suffix hallucinates or misreads visual evidence
[54]

Choose the best available category even when the drift is ambiguous
[55]

category

Ignore the standard forced final-answer probe suffix. Allowed categories: {categories} Return only valid JSON: { "category": "one_allowed_category", "secondary_categories": ["zero_or_more_allowed_categories"], "severity": 0_to_100_integer, "went_wrong": "short explanation", "evidence": "short quote or paraphrase from the added/final trace", "example": "mi...

[1] [1]

Intern-s1: A scientific multimodal foundation model.arXiv:2508.15763, 2025

Lei Bai, Zhongrui Cai, Yuhang Cao, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, et al. Intern-s1: A scientific multimodal foundation model.arXiv:2508.15763, 2025

arXiv 2025

[2] [2]

Are we on the right way for evaluating large vision-language models?NeurIPS, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?NeurIPS, 2024

2024

[3] [3]

Evaluating large language models trained on code.arXiv:2107.03374, 2021

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv:2107.03374, 2021

Pith/arXiv arXiv 2021

[4] [4]

Do not think that much for 2+ 3=? on the overthinking of o1-like llms.ICML, 2025

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.ICML, 2025

2025

[5] [5]

Training verifiers to solve math word problems.arXiv:2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021

[6] [6]

The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks.arXiv:2502.08235, 2025

Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, et al. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks.arXiv:2502.08235, 2025

arXiv 2025

[7] [7]

S-grpo: Early exit via reinforcement learning in reasoning models.NeurIPS, 2025

Muzhi Dai, Chenxu Yang, and Qingyi Si. S-grpo: Early exit via reinforcement learning in reasoning models.NeurIPS, 2025

2025

[8] [8]

Efficiently scaling llm reasoning with certaindex.NeurIPS, 2025

Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Yonghao Zhuang, Yian Ma, Aurick Qiao, Tajana Rosing, Ion Stoica, et al. Efficiently scaling llm reasoning with certaindex.NeurIPS, 2025

2025

[9] [9]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 2025

2025

[10] [10]

Measuring mathematical problem solving with the math dataset.NeurIPS, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.NeurIPS, 2021

2021

[11] [11]

Openai o1 system card.arXiv:2412.16720, 2024

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv:2412.16720, 2024

Pith/arXiv arXiv 2024

[12] [12]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016

2016

[13] [13]

Chain of thought compression: A theoritical analysis.arXiv preprint arXiv:2601.21576, 2026

Juncai Li, Ru Li, Yuxiang Zhou, Boxiang Ma, and Jeff Z Pan. Chain of thought compression: A theoritical analysis.arXiv preprint arXiv:2601.21576, 2026

Pith/arXiv arXiv 2026

[14] [14]

To think or not to think: A study of thinking in rule-based visual reinforcement fine-tuning

Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, Haoquan Zhang, Wang Bill Zhu, and Kaipeng Zhang. To think or not to think: A study of thinking in rule-based visual reinforcement fine-tuning. InNeurIPS, 2025

2025

[15] [15]

Learning to think fast and slow for visual language models.arXiv:2511.16670, 2025

Chenyu Lin, Cheng Chi, Jinlin Wu, Sharon Li, and Kaiyang Zhou. Learning to think fast and slow for visual language models.arXiv:2511.16670, 2025

arXiv 2025

[16] [16]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.NeurIPS, 2023

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.NeurIPS, 2023

2023

[17] [17]

Qfft, question-free fine-tuning for adaptive reasoning.NeurIPS, 2025

Wanlong Liu, Junxiao Xu, Fei Yu, Yukang Lin, Ke Ji, Wenyu Chen, Yan Xu, Yasheng Wang, Lifeng Shang, and Benyou Wang. Qfft, question-free fine-tuning for adaptive reasoning.NeurIPS, 2025. 10

2025

[18] [18]

Efficient inference for large reasoning models: A survey.arXiv:2503.23077, 2025

Yue Liu, Jiaying Wu, Yufei He, Ruihan Gong, Jun Xia, Liang Li, Hongcheng Gao, Hongyu Chen, Baolong Bi, Jiaheng Zhang, et al. Efficient inference for large reasoning models: A survey.arXiv:2503.23077, 2025

arXiv 2025

[19] [19]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.ICLR, 2023

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.ICLR, 2023

2023

[20] [20]

Reasoning models can be effective without thinking.arXiv:2504.09858, 2025

Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning models can be effective without thinking.arXiv:2504.09858, 2025

arXiv 2025

[21] [21]

Comparison of the predicted and observed secondary structure of t4 phage lysozyme

Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 1975

1975

[22] [23]

Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv:2503.07365, 2025

Pith/arXiv arXiv 2025

[23] [24]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InEMNLP, 2025

2025

[24] [25]

Gpqa: A graduate-level google-proof q&a benchmark.COLM, 2023

David Rein, Betty Hou, Amos Stock, William Liu, Ayan Mandlekar, Arian Ghodsi, Dara Bahri, Fan Zhou, Akshay Mehra, Eunice Yiu, et al. Gpqa: A graduate-level google-proof q&a benchmark.COLM, 2023

2023

[25] [26]

Dast: Difficulty-adaptive slow-thinking for large reasoning models

Yi Shen, Jian Zhang, Jieyun Huang, Shuming Shi, Wenjing Zhang, Jiangze Yan, Ning Wang, Kai Wang, Zhaoxiang Liu, and Shiguo Lian. Dast: Difficulty-adaptive slow-thinking for large reasoning models. In EMNLP, pages 2322–2331, 2025

2025

[26] [27]

The proof and measurement of association between two things

Charles Spearman. The proof and measurement of association between two things. 1961

1961

[27] [28]

Stop overthinking: A survey on efficient reasoning for large language models.arXiv:2503.16419, 2025

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv:2503.16419, 2025

Pith/arXiv arXiv 2025

[28] [29]

Confidence improves self-consistency in llms

Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms. InFindings-ACL 2025, 2025

2025

[29] [30]

Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.NeurIPS, 2025

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.NeurIPS, 2025

2025

[30] [31]

Measuring multimodal mathematical reasoning with math-vision dataset.NeurIPS, 2024

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.NeurIPS, 2024

2024

[31] [32]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv:2409.12191, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv:2409.12191, 2024

Pith/arXiv arXiv 2024

[32] [33]

Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning

Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, and Kan Li. Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning. In Findings-NAACL, 2025

2025

[33] [35]

Is it thinking or cheating? detecting implicit reward hacking by measuring reasoning effort

Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, and He He. Is it thinking or cheating? detecting implicit reward hacking by measuring reasoning effort. InICLR, 2026

2026

[34] [36]

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement

Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement. InNeurIPS, 2025

2025

[35] [37]

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.NeurIPS, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.NeurIPS, 2025. 11

2025

[36] [38]

When more is less: Understanding chain-of-thought length in llms.ICLR, 2026

Yuyang Wu, Yifei Wang, Ziyu Ye, Tianqi Du, Stefanie Jegelka, and Yisen Wang. When more is less: Understanding chain-of-thought length in llms.ICLR, 2026

2026

[37] [39]

Fast-slow thinking GRPO for large vision-language model reasoning

Wenyi Xiao and Leilei Gan. Fast-slow thinking GRPO for large vision-language model reasoning. In NeurIPS, 2025

2025

[38] [40]

Qwen3 technical report.arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[39] [41]

Dynamic early exit in reasoning models.ICLR, 2026

Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning models.ICLR, 2026

2026

[40] [42]

Demystifying long chain-of- thought reasoning

Shiming Yang, Yuxuan Tong, Xinyao Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of- thought reasoning. InICML, 2025

2025

[41] [43]

Reasoning models know when they’re right: Probing hidden states for self-verification.COLM, 2025

Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Reasoning models know when they’re right: Probing hidden states for self-verification.COLM, 2025

2025

[42] [44]

Adaptthink: Reasoning models can learn when to think

Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. Adaptthink: Reasoning models can learn when to think. InEMNLP, 2025

2025

[43] [45]

R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. InICCV, 2025

2025

[44] [46]

R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.ICCV, 2025

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.ICCV, 2025

2025

[45] [47]

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InECCV, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InECCV, 2024

2024

[46] [48]

Instruction tuning for large language models: A survey.ACM, 2026

Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Guoyin Wang, et al. Instruction tuning for large language models: A survey.ACM, 2026

2026

[47] [49]

American invitational mathematics examination (aime) 2025

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025. https: //huggingface.co/datasets/math-ai/aime25, 2025

2025

[48] [50]

optimized

Yuhui Zhang, Yuchang Su, Yiming Liu, Xiaohan Wang, James Burgess, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei, et al. Automated generation of challenging multiple-choice questions for vision language model evaluation. InCVPR, 2025. 12 Supplementary Material Overview This appendix is organized in four macro blocks complementing th...

2025

[49] [51]

Compare only what changed after the last-correct prefix

[50] [52]

Identify the main failure mode introduced by the final/full trace

[51] [53]

If an image is provided, decide whether the added suffix hallucinates or misreads visual evidence

[52] [54]

Choose the best available category even when the drift is ambiguous

[53] [55]

category

Ignore the standard forced final-answer probe suffix. Allowed categories: {categories} Return only valid JSON: { "category": "one_allowed_category", "secondary_categories": ["zero_or_more_allowed_categories"], "severity": 0_to_100_integer, "went_wrong": "short explanation", "evidence": "short quote or paraphrase from the added/final trace", "example": "mi...