arxiv: 2604.17399 · v1 · submitted 2026-04-19 · 💻 cs.AI

Recognition: unknown

Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning

Ziqing Zhuang , Linhai Zhang , Jiasheng Si , Deyu Zhou , Yulan He

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:09 UTC · model grok-4.3

classification 💻 cs.AI

keywords metacognitive consolidationLLM reasoningmeta-reasoningself-improvementknowledge consolidationhierarchical mechanismrole-based agents

0 comments

The pith

LLMs self-improve reasoning by consolidating metacognitive experience into reusable knowledge

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that meta-reasoning in LLMs is currently episodic and does not carry forward lessons from one problem to the next. It proposes Metacognitive Consolidation to change this by breaking down problem solving into three roles that generate detailed traces of the thinking process. These traces are then combined over time using a hierarchical update system to build meta-knowledge the model can use again. Experiments indicate that this leads to better results on reasoning tasks and that the benefits grow with more accumulated experience. This matters because it offers a way for models to get better at thinking through their own ongoing activity rather than through separate training.

Core claim

The authors claim that structuring problem solving into reasoning, monitoring, and control roles creates attributable meta-level traces. A hierarchical, multi-timescale consolidation mechanism then turns these traces into evolving meta-knowledge. This produces consistent performance gains across benchmarks and backbone models, with the gains increasing as metacognitive experience accumulates.

What carries the argument

The three-role structure for generating meta-level traces and the hierarchical multi-timescale consolidation mechanism that forms reusable meta-knowledge.

Load-bearing premise

The meta-level traces must contain knowledge that remains useful and accurate when consolidated across different problems without losing essential details or adding errors.

What would settle it

A demonstration that performance stays the same or worsens after many consolidation cycles on a set of reasoning problems would falsify the claim that accumulating meta-knowledge leads to improvement.

Figures

Figures reproduced from arXiv: 2604.17399 by Deyu Zhou, Jiasheng Si, Linhai Zhang, Yulan He, Ziqing Zhuang.

**Figure 1.** Figure 1: Metacognitive Consolidation goes beyond episodic meta-reasoning by accumulating experience across instances, leading to steadily improving performance as test-time experience grows (blue), whereas traditional meta-reasoning remains episodic and exhibits limited or fluctuating gains (red). (benchmark: MATH-500) reasoning, the ability to reason about how to reason, which offers a complementary perspective… view at source ↗

**Figure 2.** Figure 2: Overview of Metacognitive Consolidation (MC2 ). The inner loop (MRO) produces structured action– critique–correction traces via a Reasoner–Monitor–Controller decomposition, while the outer loop (MCA) consolidates these traces across instances into evolving meta-knowledge and updates role-specific policies for subsequent inference. from previous reasoning trajectories: Pt+1 = update(Pt , τt), (2) and then … view at source ↗

**Figure 3.** Figure 3: Top: Accuracy across batches and iterations. Bottom: Ratio of severe restart actions (Action 3) across [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: In this example, the Reasoner produces a wrong answer when prompted without metaknowledge. After incorporating the learned metaknowledge and micro-lessons into the prompt, the Reasoner directly arrives at the correct solution without requiring additional intervention. This case demonstrates the effectiveness of metaknowledge-guided prompt rewriting in improving reasoning accuracy. Additional examples a… view at source ↗

**Figure 5.** Figure 5: Iteration budget trade-off on MATH500 with [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Embedding cosine similarity between each [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 9.** Figure 9: Card 3: Failure iteration and diagnosis. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Card 4: Successful iteration and solution. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 8.** Figure 8: Card 2: MRO trajectory at a glance. Card 3: Iteration 1 (Failure) Reasoner final_answer: 7 Core issue (summary): Unjustified segment equality assumptions lead to an invalid equation for r. Monitor: error_found=YES, error_step=4 Monitor description: The equation 2r = 28 neglects the actual placement/tangency constraints of the semicircles. Controller decision: RESTART Controller suggestion: Redefine centers… view at source ↗

**Figure 11.** Figure 11: Reasoner prompt (cold start; no micro-lessons or meta-knowledge). [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Reasoner prompt (with retrieved micro-lessons / meta-knowledge guidance). [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Reasoner role policy update prompt (meta-level coaching to generate task-specific guidance). [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Reasoner reflection distillation prompt for micro-lesson (execution exemplar with checks and failure [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Reasoner prompt for windowed meta-knowledge consolidation (merge and update stable rules from [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Monitor prompt (cold start; no micro-lessons or meta-knowledge). [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: Monitor prompt (with retrieved micro-lessons / meta-knowledge review focus). [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: Monitor role policy update prompt (meta-level coaching to produce a concise review plan). [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

**Figure 19.** Figure 19: Monitor reflection distillation prompt for micro-lesson (diagnostic exemplar with red flags and minimal [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗

**Figure 20.** Figure 20: Monitor prompt for windowed meta-knowledge consolidation (update long-term diagnostic rules from [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗

**Figure 21.** Figure 21: Controller prompt (cold start; no micro-lessons or meta-knowledge). [PITH_FULL_IMAGE:figures/full_fig_p028_21.png] view at source ↗

**Figure 22.** Figure 22: Controller prompt (with retrieved micro-lessons / meta-knowledge decision policy). [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗

**Figure 23.** Figure 23: Controller role policy update prompt (meta-level coaching to produce a concise decision policy). [PITH_FULL_IMAGE:figures/full_fig_p029_23.png] view at source ↗

**Figure 24.** Figure 24: Controller reflection distillation prompt for micro-lesson (control exemplar mapping signals to Action [PITH_FULL_IMAGE:figures/full_fig_p029_24.png] view at source ↗

**Figure 25.** Figure 25: Controller prompt for windowed meta-knowledge consolidation (update long-term control-policy rules [PITH_FULL_IMAGE:figures/full_fig_p030_25.png] view at source ↗

read the original abstract

Large language models (LLMs) have demonstrated strong reasoning capabilities, and as existing approaches for enhancing LLM reasoning continue to mature, increasing attention has shifted toward meta-reasoning as a promising direction for further improvement. However, most existing meta-reasoning methods remain episodic: they focus on executing complex meta-reasoning routines within individual instances, but ignore the accumulation of reusable meta-reasoning skills across instances, leading to recurring failure modes and repeatedly high metacognitive effort. In this paper, we introduce Metacognitive Consolidation, a novel framework in which a model consolidates metacognitive experience from past reasoning episodes into reusable knowledge that improves future meta-reasoning. We instantiate this framework by structuring instance-level problem solving into distinct roles for reasoning, monitoring, and control to generate rich, attributable meta-level traces. These traces are then consolidated through a hierarchical, multi-timescale update mechanism that gradually forms evolving meta-knowledge. Experimental results demonstrate consistent performance gains across benchmarks and backbone models, and show that performance improves as metacognitive experience accumulates over time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new angle is turning episodic meta-reasoning into accumulating meta-knowledge through role-structured traces and hierarchical consolidation, though the real test is whether those traces stay clean and useful over time.

read the letter

The paper introduces Metacognitive Consolidation to let LLMs build reusable meta-reasoning knowledge across instances instead of restarting from scratch each time. They split problem solving into reasoning, monitoring, and control roles to generate attributable traces, then apply a hierarchical multi-timescale update to consolidate them into evolving meta-knowledge. This directly targets the recurring failure modes and high per-instance effort in prior episodic approaches. The role structure is a straightforward way to make traces richer and more structured than standard chain-of-thought, and the multi-timescale idea gives a plausible mechanism for gradual skill buildup without overwriting everything at once. The reported experiments show consistent gains across benchmarks and models, plus further improvement as experience accumulates, which is the central empirical claim. If the controls confirm that the consolidation step, rather than just extra data or prompting, drives the gains, this is a useful step toward self-improving reasoning systems. The soft spot is the consolidation process itself. It is not obvious how the hierarchical updates extract reusable knowledge without injecting noise or dropping task-specific details, and the abstract-level description leaves room for doubt on whether the traces remain attributable after repeated updates. Stronger ablations against simple trace storage or repeated prompting would help anchor the accumulation benefit. The work is aimed at researchers working on meta-reasoning and agentic LLM systems. It has a clear direction and enough experimental grounding to deserve peer review, with the main questions likely centering on mechanism details and the robustness of the long-term gains.

Referee Report

2 major / 1 minor

Summary. The paper introduces Metacognitive Consolidation, a framework in which LLMs structure instance-level problem solving into distinct reasoning, monitoring, and control roles to generate attributable meta-level traces. These traces are consolidated via a hierarchical multi-timescale update mechanism that forms evolving reusable meta-knowledge. The central claim is that this produces consistent performance gains across benchmarks and backbone models, with further improvement as metacognitive experience accumulates over time.

Significance. If the results hold, the work offers a principled shift from episodic meta-reasoning to cumulative self-improvement in LLMs, potentially mitigating recurring failure modes and reducing repeated metacognitive effort. The three-role trace generation combined with hierarchical consolidation is a distinctive contribution that could influence future designs for long-term meta-cognitive development.

major comments (2)

[Experiments] Experimental results section: the claim that performance improves as metacognitive experience accumulates is load-bearing for the paper's contribution, yet the provided description supplies no quantitative metrics, baselines, statistical tests, or ablations (e.g., consolidation vs. simple trace accumulation or repeated prompting). Without these, it is impossible to confirm that gains derive from the proposed mechanism rather than increased context length or other confounds.
[Method] Method section on the hierarchical multi-timescale update: the assertion that meta-level traces contain reusable, attributable knowledge that can be consolidated without noise or loss of task-specific details requires explicit specification (e.g., update equations or pseudocode) and controls showing preservation of structure. This assumption underpins the accumulation benefit and is not yet anchored by evidence.

minor comments (1)

[Abstract] Abstract: adding one sentence naming the benchmarks and backbone models used would make the performance claims more concrete for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of Metacognitive Consolidation as a shift toward cumulative self-improvement in LLMs. We address each major comment below and will incorporate revisions to strengthen the empirical and methodological grounding of the claims.

read point-by-point responses

Referee: [Experiments] Experimental results section: the claim that performance improves as metacognitive experience accumulates is load-bearing for the paper's contribution, yet the provided description supplies no quantitative metrics, baselines, statistical tests, or ablations (e.g., consolidation vs. simple trace accumulation or repeated prompting). Without these, it is impossible to confirm that gains derive from the proposed mechanism rather than increased context length or other confounds.

Authors: We agree that the accumulation claim requires stronger quantitative support to rule out confounds. The current experiments demonstrate consistent gains across benchmarks as experience accumulates, but we acknowledge the need for explicit metrics, baselines, and controls. In the revision we will expand the experimental section with: mean performance and standard deviations over multiple runs; paired statistical tests with p-values; direct baselines including simple trace accumulation (without hierarchical updates) and repeated prompting (without consolidation); and context-length-matched controls. These additions will isolate the contribution of the multi-timescale consolidation mechanism. revision: yes
Referee: [Method] Method section on the hierarchical multi-timescale update: the assertion that meta-level traces contain reusable, attributable knowledge that can be consolidated without noise or loss of task-specific details requires explicit specification (e.g., update equations or pseudocode) and controls showing preservation of structure. This assumption underpins the accumulation benefit and is not yet anchored by evidence.

Authors: We accept that the hierarchical update mechanism would benefit from greater formality and supporting evidence. The method section describes the three-role trace generation and multi-timescale consolidation, but lacks explicit equations. In the revised manuscript we will add update equations and pseudocode for the short-, medium-, and long-term consolidation steps. We will also include a new analysis showing preservation of task-specific structure, for example by reporting similarity metrics between original traces and consolidated knowledge and by evaluating downstream performance on tasks that require retention of instance-specific details. revision: yes

Circularity Check

0 steps flagged

No circularity: framework and gains presented as independent experimental mechanism

full rationale

The paper introduces Metacognitive Consolidation as a new framework that structures reasoning into roles to produce traces and applies a hierarchical multi-timescale update to form meta-knowledge. No equations, fitted parameters, self-citations, or uniqueness theorems appear in the abstract or described structure. The central claim of accumulating performance gains is supported by experimental results across benchmarks rather than any derivation that reduces to its own inputs by construction. The mechanism is presented as an external addition to existing meta-reasoning, with no load-bearing step that renames or tautologically re-derives the observed improvement.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework implicitly assumes that meta-traces are sufficiently rich and attributable for consolidation to succeed.

pith-pipeline@v0.9.0 · 5492 in / 1067 out tokens · 38469 ms · 2026-05-10T06:09:57.655085+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 16 canonical work pages · 5 internal anchors

[2]

Charlie Victor Snell and Jaehoon Lee and Kelvin Xu and Aviral Kumar , booktitle=. Scaling. 2025 , url=

2025
[3]

Position:

Hanqi Yan and Linhai Zhang and Jiazheng Li and Zhenyi Shen and Yulan He , booktitle=. Position:. 2025 , url=

2025
[4]

Ziyu Wan and Yunxiang LI and Xiaoyu Wen and Yan Song and Hanjing Wang and Linyi Yang and Mark Schmidt and Jun Wang and Weinan Zhang and Shuyue Hu and Ying Wen , booktitle=. Re. 2025 , url=

2025
[12]

, author=

Acquisition of cognitive skill. , author=. Psychological review , volume=. 1982 , publisher=

1982
[13]

Advances in Neural Information Processing Systems , editor=

Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

2022
[15]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Titans: Learning to Memorize at Test Time , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[16]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Nested Learning: The Illusion of Deep Learning Architectures , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[17]

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for

Yangzhen Wu and Zhiqing Sun and Shanda Li and Sean Welleck and Yiming Yang , booktitle=. Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for. 2025 , url=

2025
[18]

2024 , eprint=

OpenAI o1 System Card , author=. 2024 , eprint=

2024
[19]

2025 , eprint=

MetaScale: Test-Time Scaling with Evolving Meta-Thoughts , author=. 2025 , eprint=

2025
[20]

2025 , eprint=

Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models , author=. 2025 , eprint=

2025
[21]

2025 , eprint=

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory , author=. 2025 , eprint=

2025
[22]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Reflexion: language agents with verbal reinforcement learning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[23]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[24]

ReasonFlux-

Jiaru Zou and Ling Yang and Jingwen Gu and Jiahao Qiu and Ke Shen and Jingrui He and Mengdi Wang , booktitle=. ReasonFlux-. 2025 , url=

2025
[25]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

2021
[26]

Measuring Mathematical Problem Solving With the

Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , booktitle=. Measuring Mathematical Problem Solving With the. 2021 , url=

2021
[27]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[28]

The Eleventh International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=
[29]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Self-Refine: Iterative Refinement with Self-Feedback , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[30]

Trends in Cognitive Sciences, 607–617 (2017) https: //doi.org/10.1016/j.tics.2017.05.004

Rakefet Ackerman and Valerie A. Thompson. 2017. https://doi.org/10.1016/j.tics.2017.05.004 Meta-reasoning: Monitoring and control of thinking and reasoning . Trends in Cognitive Sciences, 21(8):607--617

work page doi:10.1016/j.tics.2017.05.004 2017
[31]

John R Anderson. 1982. Acquisition of cognitive skill. Psychological review, 89(4):369

1982
[32]

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. 2025 a . https://openreview.net/forum?id=nbMeRvNb7A Nested learning: The illusion of deep learning architectures . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

2025
[33]

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. 2025 b . https://openreview.net/forum?id=8GjSf9Rh7Z Titans: Learning to memorize at test time . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

2025
[34]

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.489 T heorem QA : A theorem-driven question answering dataset . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7889--7901, Singapore. Association for Computat...

work page doi:10.18653/v1/2023.emnlp-main.489 2023
[35]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems . Preprint, arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[36]

Peizhong Gao, Ao Xie, Shaoguang Mao, Wenshan Wu, Yan Xia, Haipeng Mi, and Furu Wei. 2024. Meta reasoning for large language models. arXiv preprint arXiv:2406.11698

work page arXiv 2024
[37]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Peng, Zhaoxiang Zhang, Zhicheng Zheng, Wenbo Su, and Bo Zheng

Yancheng He, Shilong Li, Jiaheng Liu, Weixun Wang, Xingyuan Bu, Ge Zhang, Z.y. Peng, Zhaoxiang Zhang, Zhicheng Zheng, Wenbo Su, and Bo Zheng. 2025. https://doi.org/10.18653/v1/2025.acl-long.905 Can large language models detect errors in long chain-of-thought reasoning? In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguist...

work page doi:10.18653/v1/2025.acl-long.905 2025
[39]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. https://openreview.net/forum?id=7Bywt2mQsCe Measuring mathematical problem solving with the MATH dataset . In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

2021
[40]

Huang, Fei Wang, Sheng Zhang, Hoifung Poon, and Muhao Chen

Qin Liu, Wenxuan Zhou, Nan Xu, James Y. Huang, Fei Wang, Sheng Zhang, Hoifung Poon, and Muhao Chen. 2025. https://arxiv.org/abs/2503.13447 Metascale: Test-time scaling with evolving meta-thoughts . Preprint, arXiv:2503.13447

work page arXiv 2025
[41]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. https://openreview.net/forum?id=S37hOerQLB Self-refine: Iterative refinement with self-feedback...

2023
[42]

OpenAI, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, and 243 others. 2024. https://arxiv.org/abs/2412.16720 Openai o1 system card . P...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. 2025. https://arxiv.org/abs/2509.25140 Reasoningbank: Scaling agent self-evolving with reasoning memory . Preprint, arXiv:2509.25140

work page internal anchor Pith review arXiv 2025
[44]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. 2023. https://openreview.net/forum?id=vAElhFcKW6 Reflexion: language agents with verbal reinforcement learning . In Thirty-seventh Conference on Neural Information Processing Systems

2023
[45]

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2025. https://openreview.net/forum?id=4FWAwZtd2n Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning . In The Thirteenth International Conference on Learning Representations

2025
[46]

Yuan Sui, Yufei He, Tri Cao, Simeng Han, Yulin Chen, and Bryan Hooi. 2025. https://arxiv.org/abs/2502.19918 Meta-reasoner: Dynamic guidance for optimized inference-time reasoning in large language models . Preprint, arXiv:2502.19918

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Mirac Suzgun and Adam Tauman Kalai. 2024. Meta-prompting: Enhancing language models with task-agnostic scaffolding. arXiv preprint arXiv:2401.12954

work page arXiv 2024
[48]

Hexiang Tan, Fei Sun, Sha Liu, Du Su, Qi Cao, Xin Chen, Jingang Wang, Xunliang Cai, Yuanzhuo Wang, Huawei Shen, and Xueqi Cheng. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.238 Too consistent to detect: A study of self-consistent errors in LLM s . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4755--...

work page doi:10.18653/v1/2025.emnlp-main.238 2025
[49]

Ziyu Wan, Yunxiang LI, Xiaoyu Wen, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, and Ying Wen. 2025. https://openreview.net/forum?id=ur295YVtmt Re MA : Learning to meta-think for LLM s with multi-agent reinforcement learning . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

2025
[50]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. https://openreview.net/forum?id=1PL1NIMMrw Self-consistency improves chain of thought reasoning in language models . In The Eleventh International Conference on Learning Representations

2023
[51]

Yuqing Wang and Yun Zhao. 2024. https://doi.org/10.18653/v1/2024.naacl-long.106 Metacognitive prompting improves understanding in large language models . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1914--1926, Mexico City, M...

work page doi:10.18653/v1/2024.naacl-long.106 2024
[52]

Chi, Quoc V Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. 2022. https://openreview.net/forum?id=_VjQlMeSB_J Chain of thought prompting elicits reasoning in large language models . In Advances in Neural Information Processing Systems

2022
[53]

Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. 2025. https://openreview.net/forum?id=VNckp7JEHn Inference scaling laws: An empirical analysis of compute-optimal inference for LLM problem-solving . In The Thirteenth International Conference on Learning Representations

2025
[54]

Jun Xu, Xinkai Du, Yu Ao, Peilong Zhao, Yang Li, Ling Zhong, Lin Yuan, Zhongpu Bo, Xiaorui Wang, Mengshu Sun, and 1 others. 2025. Thinker: Training llms in hierarchical thinking for deep search via multi-turn interaction. arXiv preprint arXiv:2511.07943

work page arXiv 2025
[55]

Hanqi Yan, Linhai Zhang, Jiazheng Li, Zhenyi Shen, and Yulan He. 2025. https://openreview.net/forum?id=RrvhbxO2hd Position: LLM s need a bayesian meta-reasoning framework for more robust and generalizable reasoning . In Forty-second International Conference on Machine Learning Position Paper Track

2025
[56]

Chenxu Yang, Qingyi Si, Mz Dai, Dingyu Yao, Mingyu Zheng, Minghui Chen, Zheng Lin, and Weiping Wang. 2025. Test-time prompt intervention. arXiv preprint arXiv:2508.02511

work page arXiv 2025
[57]

Gonzalez, and Bin CUI

Ling Yang, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, Minkai Xu, Wentao Zhang, Joseph E. Gonzalez, and Bin CUI. 2024. https://openreview.net/forum?id=ANO1i9JPtb Buffer of thoughts: Thought-augmented reasoning with large language models . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

2024
[58]

Griffiths, Yuan Cao, and Karthik R Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R Narasimhan. 2023. https://openreview.net/forum?id=5Xc1ecxO1h Tree of thoughts: Deliberate problem solving with large language models . In Thirty-seventh Conference on Neural Information Processing Systems

2023
[59]

Zijing Zhang, Ziyang Chen, Mingxiao Li, Zhaopeng Tu, and Xiaolong Li. 2025. Rlvmr: Reinforcement learning with verifiable meta-reasoning rewards for robust long-horizon agents. arXiv preprint arXiv:2507.22844

work page arXiv 2025
[60]

Jiaru Zou, Ling Yang, Jingwen Gu, Jiahao Qiu, Ke Shen, Jingrui He, and Mengdi Wang. 2025. https://openreview.net/forum?id=f3sZjkQbv2 Reasonflux- PRM : Trajectory-aware PRM s for long chain-of-thought reasoning in LLM s . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

2025