PathCal: State-Aware Reflection-Marker Calibration for Efficient Reasoning
Pith reviewed 2026-05-25 05:20 UTC · model grok-4.3
The pith
PathCal uses reflection-marker distributions to intervene only at uncertain reasoning states and shorten outputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PathCal is a training-free decoding controller that distinguishes marker types and intervenes only at locally uncertain states: at each step it uses the distribution over reflection markers to estimate competition between the current trajectory and a competing branch, then softly rebalances marker logits when competing-branch evidence becomes excessive, yielding a better efficiency-performance trade-off across six reasoning benchmarks.
What carries the argument
PathCal controller, which estimates local competition from the reflection-marker distribution and rebalances logits only when competing-branch evidence grows excessive.
If this is right
- Accuracy is preserved or improved while generation length decreases on six reasoning benchmarks.
- Intervention occurs only before the model settles into a stable trajectory.
- Different marker classes produce distinct effects on accuracy versus length.
- No external verifiers or additional sampling steps are required.
- The method works as a lightweight addition to existing decoding.
Where Pith is reading between the lines
- The same marker-distribution signal might be usable to detect other forms of local uncertainty beyond explicit reflection tokens.
- If marker probabilities prove diagnostic in one family of models, similar hesitation signals could be mined in models that lack explicit markers.
- PathCal-style local rebalancing could be combined with existing length-penalty or early-exit heuristics to produce further efficiency gains.
Load-bearing premise
The distribution over reflection markers at each decoding step supplies a reliable estimate of local competition between the current trajectory and any competing branch.
What would settle it
Running PathCal on the same six benchmarks and finding that average generation length increases or accuracy drops would falsify the central claim.
Figures
read the original abstract
The emergence of Large Reasoning Language Models (LRMs) has paved the way for tackling complex reasoning tasks through test-time scaling by generating long-form Chain-of-Thought (CoT) trajectories during inference. Meanwhile, these trajectories often contain explicit reflection markers such as ``wait'', ``but'', and ``alternatively'', signaling hesitation, revision, and the consideration of alternative explorations, respectively. Recent studies on test-time control leverage such markers as lightweight handles for steering reasoning, typically treating them as a single coarse-grained category rather than distinguishing their distinct functional roles. In this paper, we conduct type-wise suppression and fixed-prefix intervention, revealing that reflection markers differ not only in their functional roles but also in when they exert the greatest influence. Specifically, different marker classes affect accuracy and generation length in distinct ways, and marker choices are most consequential before the model settles into a stable reasoning trajectory. Motivated by these findings, we introduce PathCal, a novel training-free decoding controller that calibrates reasoning paths by distinguishing marker types and intervening only at locally uncertain states. At each decoding step, PathCal utilizes the distribution over reflection-markers to estimate local competition between maintaining the current reasoning trajectory and initiating a competing branch, and softly rebalances marker logits when competing-branch evidence becomes excessive. Experiments across six reasoning benchmarks demonstrate that PathCal achieves a better efficiency--performance trade-off, improving or preserving accuracy while reducing generation length, without relying on external verifiers or additional sampling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PathCal, a training-free decoding controller for large reasoning models (LRMs) that distinguishes functional roles of reflection markers (e.g., 'wait', 'but', 'alternatively') via type-wise suppression and fixed-prefix experiments. It intervenes only at locally uncertain states by using the marker distribution to estimate competition between the current trajectory and competing branches, then softly rebalances logits when competing-branch probability is excessive. Experiments on six reasoning benchmarks are reported to show improved or preserved accuracy with reduced generation length, without external verifiers or additional sampling.
Significance. If the empirical claims hold, the work is significant for providing a lightweight, parameter-free, distribution-driven method to improve the efficiency-performance trade-off in test-time scaling of LRMs. The type-wise analysis of markers and the focus on local intervention without external components are strengths that distinguish it from prior test-time control approaches.
major comments (2)
- [Results section] Results section: the central empirical claim of a better efficiency-performance trade-off rests on reported outcomes across six benchmarks, yet the manuscript supplies no baselines, error bars, exclusion rules, or statistical tests; this prevents assessment of whether improvements reflect post-hoc selection or fitting artifacts rather than the proposed local intervention.
- [Method section] Method section (PathCal description): the claim that marker distribution estimates 'local competition' between trajectories is load-bearing for the intervention rule, but the manuscript does not specify the exact threshold or rebalancing function, leaving open whether the heuristic reduces to a quantity defined from the same data.
minor comments (2)
- [Abstract] Abstract: the list of six benchmarks is not named, which would help readers immediately gauge the scope of the evaluation.
- [Introduction] Notation: the terms 'locally uncertain states' and 'competing-branch evidence' are used without an explicit definition or equation in the early sections, which could be clarified for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive feedback on our manuscript. We address each major comment below.
read point-by-point responses
-
Referee: [Results section] Results section: the central empirical claim of a better efficiency-performance trade-off rests on reported outcomes across six benchmarks, yet the manuscript supplies no baselines, error bars, exclusion rules, or statistical tests; this prevents assessment of whether improvements reflect post-hoc selection or fitting artifacts rather than the proposed local intervention.
Authors: We acknowledge the referee's concern regarding the presentation of results. The manuscript reports performance across six benchmarks but indeed lacks error bars, statistical tests, and explicit exclusion rules. We will update the results section to include error bars from multiple runs, specify baselines more clearly, and incorporate statistical significance tests to better substantiate the efficiency-performance trade-offs. revision: yes
-
Referee: [Method section] Method section (PathCal description): the claim that marker distribution estimates 'local competition' between trajectories is load-bearing for the intervention rule, but the manuscript does not specify the exact threshold or rebalancing function, leaving open whether the heuristic reduces to a quantity defined from the same data.
Authors: We thank the referee for highlighting this issue. The description of PathCal indicates that the marker distribution is used to estimate local competition and that logits are softly rebalanced when competing-branch evidence is excessive. To address the lack of specificity, we will provide the exact threshold value and the mathematical form of the rebalancing function in the revised method section, ensuring the intervention rule is fully specified and reproducible. revision: yes
Circularity Check
No significant circularity detected
full rationale
The provided abstract and description contain no equations, fitted parameters, or self-citations that reduce any claimed prediction or result to its own inputs by construction. PathCal is presented as a training-free heuristic that uses observed marker distributions for local intervention, with the efficiency-performance trade-off supported by direct benchmark experiments rather than any definitional or fitted equivalence. The derivation chain relies on empirical type-wise suppression findings that are independent of the final controller outputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Reflection markers differ in functional roles and exert greatest influence before the model settles into a stable trajectory.
- domain assumption Marker logit distributions reliably indicate local competition between current trajectory and competing branches.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
g_t = 4 C_t B_t / (C_t + B_t)^2 + ε ... α_t = α_base g_t min{[B_t − C_t + γ]_+ / τ, 1}
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
PATHCAL uses marker probabilities to detect local reasoning-mode competition and softly rebalances marker logits when competing-branch evidence becomes excessive
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
L1: Controlling how long a reasoning model thinks with reinforcement learning, 2025
Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning, 2025
2025
-
[2]
Aimo validation amc
AI-MO. Aimo validation amc. https://huggingface.co/datasets/AI-MO/ aimo-validation-amc, 2024
2024
-
[3]
Interpretation of discourse connectives is probabilistic: Evidence from the study of but and although.Discourse Processes, 57(4):376–399, 2020
Fatemeh Torabi Asr and Vera Demberg. Interpretation of discourse connectives is probabilistic: Evidence from the study of but and although.Discourse Processes, 57(4):376–399, 2020
2020
-
[4]
Aytes, Jinheon Baek, and Sung Ju Hwang
Simon A. Aytes, Jinheon Baek, and Sung Ju Hwang. Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching, 2025
2025
-
[5]
Math- arena: Evaluating llms on uncontaminated math competitions, 2026
Mislav Balunovi´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. Math- arena: Evaluating llms on uncontaminated math competitions, 2026
2026
-
[6]
Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy
Paul C. Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy. Thought anchors: Which llm reasoning steps matter?, 2025
2025
-
[7]
Le, Christopher Ré, and Azalia Mirhoseini
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024
2024
-
[8]
Unveiling the latent directions of reflection in large language models, 2025
Fu-Chieh Chang, Yu-Ting Lee, and Pei-Yuan Wu. Unveiling the latent directions of reflection in large language models, 2025
2025
-
[9]
Directional reasoning trajectory change (drtc): Identifying critical trace segments in reasoning models, 2026
Waldemar Chang. Directional reasoning trajectory change (drtc): Identifying critical trace segments in reasoning models, 2026
2026
-
[10]
TheoremQA: A theorem-driven question answering dataset
Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. TheoremQA: A theorem-driven question answering dataset. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7889–7901, Singapore, December 2023. Association for C...
2023
-
[11]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Zilin Dai, Lehong Wang, Fangzhou Lin, Yidong Wang, Zhigang Li, Kazunori D Yamada, Ziming Zhang, and Wang Lu. A language anchor-guided method for robust noisy domain generalization.arXiv preprint arXiv:2503.17211, 2025
-
[13]
Do thinking tokens help or trap? towards more efficient large reasoning model, 2025
Bowen Ding, Yuhan Chen, Futing Wang, Lingfeng Ming, and Tao Lin. Do thinking tokens help or trap? towards more efficient large reasoning model, 2025
2025
-
[14]
Hero, and Sijia Liu
Chongyu Fan, Yihua Zhang, Jinghan Jia, Alfred O. Hero, and Sijia Liu. Cyclicreflex: Improving reasoning models via cyclical reflection token scheduling. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[15]
Alphazero-like tree-search can guide large language model decoding and training, 2024
Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024
2024
-
[16]
What charac- terizes effective reasoning? revisiting length, review, and structure of cot, 2025
Yunzhen Feng, Julia Kempe, Cheng Zhang, Parag Jain, and Anthony Hartshorn. What charac- terizes effective reasoning? revisiting length, review, and structure of cot, 2025
2025
-
[17]
Efficiently scaling llm reasoning with certaindex, 2025
Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Yonghao Zhuang, Yian Ma, Aurick Qiao, Tajana Rosing, Ion Stoica, and Hao Zhang. Efficiently scaling llm reasoning with certaindex, 2025
2025
-
[18]
Rogov, Elena Tutubalina, and Ivan Oseledets
Andrey Galichin, Alexey Dontsov, Polina Druzhinina, Anton Razzhigaev, Oleg Y . Rogov, Elena Tutubalina, and Ivan Oseledets. I have covered all the bases here: Interpreting reasoning features in large language models via sparse autoencoders, 2025. 10
2025
-
[19]
Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D. Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars, 2025
2025
-
[20]
The llama 3 herd of models, 2024
Aaron Grattafiori et al. The llama 3 herd of models, 2024
2024
-
[21]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Token-budget-aware llm reasoning, 2025
Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware llm reasoning, 2025
2025
-
[23]
Measuring mathematical problem solving with the math dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021
2021
-
[24]
WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking
Dengzhe Hou, Lingyu Jiang, Deng Li, Zirui Li, Fangzhou Lin, and Kazunori D Yamada. Wmf- am: Probing llm working memory via depth-parameterized cumulative state tracking.arXiv preprint arXiv:2603.27343, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
Shijue Huang, Hongru Wang, Wanjun Zhong, Zhaochen Su, Jiazhan Feng, Bowen Cao, and Yi R. Fung. Adactrl: Towards adaptive and controllable reasoning via difficulty-aware budgeting, 2025
2025
-
[26]
TimePre: Bridging Accuracy, Efficiency, and Stability in Probabilistic Time-Series Forecasting
Lingyu Jiang, Lingyu Xu, Peiran Li, Qianwen Ge, Dingyi Zhuang, Shuo Xing, Wenjing Chen, Xiangbo Gao, Ting-Hsuan Chen, Xueying Zhan, et al. Timepre: Bridging accuracy, efficiency, and stability in probabilistic time-series forecasting.arXiv preprint arXiv:2511.18539, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
First try matters: Revisiting the role of reflection in reasoning models, 2025
Liwei Kang, Yue Deng, Yao Xiao, Zhanfeng Mo, Wee Sun Lee, and Lidong Bing. First try matters: Revisiting the role of reflection in reasoning models, 2025
2025
-
[28]
Large language models are zero-shot reasoners, 2023
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023
2023
-
[29]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023
2023
-
[30]
Bowman, and Ethan Perez
Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoši¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Lar- son, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy ...
2023
-
[31]
Inference- time intervention: Eliciting truthful answers from a language model
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. InThirty-seventh Confer- ence on Neural Information Processing Systems, 2023
2023
-
[32]
Peiran Li, Fangzhou Lin, Shuo Xing, Jiashuo Sun, Dylan Zhang, Siyuan Yang, Chaoqun Ni, and Zhengzhong Tu. Let the abyss stare back adaptive falsification for autonomous scientific discovery.arXiv preprint arXiv:2603.29045, 2026
-
[33]
Peiran Li, Fangzhou Lin, Shuo Xing, Xiang Zheng, Xi Hong, Siyuan Yang, Jiashuo Sun, Zhengzhong Tu, and Chaoqun Ni. Bibagent: An agentic framework for traceable miscitation detection in scientific literature.arXiv preprint arXiv:2601.16993, 2026
-
[34]
Peiran Li, Jiashuo Sun, Fangzhou Lin, Shuo Xing, Tianfu Fu, Suofei Feng, Chaoqun Ni, and Zhengzhong Tu. Traversal-as-policy: Log-distilled gated behavior trees as externalized, verifiable policies for safe, robust, and efficient agents.arXiv preprint arXiv:2603.05517, 2026. 11
-
[35]
Contrastive decoding: Open-ended text generation as optimization
Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Lon...
2023
-
[36]
Let’s verify step by step
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[37]
Fangzhou Lin, Qianwen Ge, Lingyu Xu, Peiran Li, Xiangbo Gao, Shuo Xing, Kazunori Yamada, Ziming Zhang, Haichong Zhang, and Zhengzhong Tu. Position: Human-centric ai requires a minimum viable level of human understanding.arXiv preprint arXiv:2602.00854, 2026
-
[38]
AdaptFuse: Training-Free Sequential Preference Learning via Externalized Bayesian Inference
Fangzhou Lin, Peiran Li, Shuo Xing, Siyuan Yang, Qianwen Ge, Kazunori Yamada, Ziming Zhang, Haichong Zhang, and Zhengzhong Tu. Adaptfuse: Training-free sequential preference learning via externalized bayesian inference.arXiv preprint arXiv:2604.03925, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[39]
CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning
Fangzhou Lin, Shuo Xing, Peiran Li, Siyuan Yang, Qianwen Ge, Kazunori Yamada, Ziming Zhang, Haichong Zhang, and Zhengzhong Tu. Caps: Cascaded adaptive pairwise selection for efficient parallel reasoning.arXiv preprint arXiv:2605.15513, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[40]
Cot-valve: Length-compressible chain-of-thought tuning, 2025
Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, and Xinchao Wang. Cot-valve: Length-compressible chain-of-thought tuning, 2025
2025
-
[41]
Self-refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InThirty-seventh Conference on Neural Informati...
2023
-
[42]
Niklas Muennighoff, Zitong Yang, Weijia Shi, et al. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
introducing-o3-and-o4-mini.OpenAI Blog, 2025
OpenAI. introducing-o3-and-o4-mini.OpenAI Blog, 2025
2025
-
[44]
Plum: Prompt learning using metaheuristics
Rui Pan, Shuo Xing, Shizhe Diao, Wenhe Sun, Xiang Liu, KaShun Shum, Jipeng Zhang, Renjie Pi, and Tong Zhang. Plum: Prompt learning using metaheuristics. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 2177–2197, Bangkok, Thailand, August 2024. Association for Computationa...
2024
-
[45]
Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning, 2024
Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning, 2024
2024
-
[46]
Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in llm reasoning, 2025
Chen Qian, Dongrui Liu, Haochen Wen, Zhen Bai, Yong Liu, and Jing Shao. Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in llm reasoning, 2025
2025
-
[47]
Concise: Confidence-guided compression in step-by-step efficient reasoning.Proceedings of EMNLP, 2025
Ziqing Qiao, Yongheng Deng, Jiali Zeng, Dong Wang, et al. Concise: Confidence-guided compression in step-by-step efficient reasoning.Proceedings of EMNLP, 2025
2025
-
[48]
Qwen2.5 technical report, 2025
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
2025
-
[49]
Qwq: Reflect deeply on the boundaries of the unknown, 2025
Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, 2025. 12
2025
-
[50]
Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning
Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[51]
Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms, 2025
Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms, 2025
2025
-
[52]
Thinking by subtraction: Confidence-driven contrastive decoding for llm reasoning, 2026
Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin, Bang Yang, and Yuexian Zou. Thinking by subtraction: Confidence-driven contrastive decoding for llm reasoning, 2026
2026
-
[53]
Kimi k2: Open agentic intelligence, 2026
Kimi Team. Kimi k2: Open agentic intelligence, 2026
2026
-
[54]
Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023
2023
-
[55]
Do large language model benchmarks test reliability?, 2025
Joshua Vendrow, Edward Vendrow, Sara Beery, and Aleksander Madry. Do large language model benchmarks test reliability?, 2025
2025
-
[56]
Investigating gender bias in language models using causal mediation analysis
Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 12388–12401. Curran Associa...
2020
-
[57]
Adapthink: Adaptive thinking preferences for reasoning language model, 2025
Xu Wan, Wei Wang, Wenyue Xu, Wotao Yin, Jie Song, and Mingyang Sun. Adapthink: Adaptive thinking preferences for reasoning language model, 2025
2025
-
[58]
Self-consistency improves chain of thought reasoning in language models.International Conference on Learning Representations (ICLR), 2023
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.International Conference on Learning Representations (ICLR), 2023
2023
-
[59]
R1-compress: Long chain-of-thought compression via chunk compression and search.arXiv preprint, 2025
Yibo Wang, Haotian Luo, Li Shen, et al. R1-compress: Long chain-of-thought compression via chunk compression and search.arXiv preprint, 2025
2025
-
[60]
Yue Wang et al. Thoughts are all over the place: On the underthinking of o1-like llms.arXiv preprint arXiv:2501.18585, 2025
-
[61]
Reasoning-finetuning repurposes latent representations in base models, 2025
Jake Ward, Chuqiao Lin, Constantin Venhoff, and Neel Nanda. Reasoning-finetuning repurposes latent representations in base models, 2025
2025
-
[62]
Chi, Quoc V Le, and Denny Zhou
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022
2022
-
[63]
It’s not that simple
Guojun Wu. It’s not that simple. an analysis of simple test-time scaling, 2025
2025
-
[64]
Tokenskip: Controllable chain-of-thought compression in llms.Proceedings of EMNLP, 2025
Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms.Proceedings of EMNLP, 2025
2025
-
[65]
Chain of draft: Thinking faster by writing less, 2025
Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. Chain of draft: Thinking faster by writing less, 2025
2025
-
[66]
A*-thought: Efficient reasoning via bidirectional compression for low-resource settings.arXiv preprint, 2025
Xiaoang Xu, Shuo Wang, Xu Han, et al. A*-thought: Efficient reasoning via bidirectional compression for low-resource settings.arXiv preprint, 2025
2025
-
[67]
Dynamic early exit in reasoning models, 2025
Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning models, 2025
2025
-
[68]
Wong, and Di Wang
Shu Yang, Junchao Wu, Xin Chen, Yunze Xiao, Xinyi Yang, Derek F. Wong, and Di Wang. Understanding aha moments: from external observations to internal mechanisms, 2025
2025
-
[69]
Towards thinking-optimal scaling of test-time compute for llm reasoning, 2025
Wenkai Yang, Shuming Ma, Yankai Lin, and Furu Wei. Towards thinking-optimal scaling of test-time compute for llm reasoning, 2025. 13
2025
-
[70]
Griffiths, Yuan Cao, and Karthik Narasimhan
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023
2023
-
[71]
Understanding hyperbolic metric learning through hard negative sampling
Yun Yue, Fangzhou Lin, Guanyi Mou, and Ziming Zhang. Understanding hyperbolic metric learning through hard negative sampling. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1891–1903, 2024
1903
-
[72]
Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities?, 2025
Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Yunhua Zhou, and Xipeng Qiu. Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities?, 2025
2025
-
[73]
Gps: A probabilistic distributional similarity with gumbel priors for set-to-set matching
Ziming Zhang, Fangzhou Lin, Haotian Liu, Jose Morales, Haichong Zhang, Kazunori Yamada, Vijaya B Kolachalama, and Venkatesh Saligrama. Gps: A probabilistic distributional similarity with gumbel priors for set-to-set matching. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[74]
Deep loss convexification for learning iterative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1501–1513, 2024
Ziming Zhang, Yuping Shao, Yiqing Zhang, Fangzhou Lin, Haichong Zhang, and Elke Runden- steiner. Deep loss convexification for learning iterative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1501–1513, 2024
2024
-
[75]
the answer is
Jiachen Zhao, Yiyou Sun, Weiyan Shi, and Dawn Song. Can aha moments be fake? identifying true and decorative thinking steps in chain-of-thought, 2026. 14 A Complete Experimental Setup Models.We evaluate four open-source reasoning language models that span scales, backbones, and distillation pipelines:DeepSeek-R1-Distill-Qwen-7B,DeepSeek-R1-Distill-Qwen-14...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.