PathCal: State-Aware Reflection-Marker Calibration for Efficient Reasoning

Dengzhe Hou; Fangzhou Lin; Kazunori Yamada; Lingyu Jiang; Peiran Li; Shuo Xing; Tsubasa Takahashi; Zhengzhong Tu; Zirui Li

arxiv: 2605.23074 · v1 · pith:X2CEXLG7new · submitted 2026-05-21 · 💻 cs.AI

PathCal: State-Aware Reflection-Marker Calibration for Efficient Reasoning

Lingyu Jiang , Zirui Li , Shuo Xing , Peiran Li , Tsubasa Takahashi , Dengzhe Hou , Zhengzhong Tu , Kazunori Yamada

show 1 more author

Fangzhou Lin

This is my paper

Pith reviewed 2026-05-25 05:20 UTC · model grok-4.3

classification 💻 cs.AI

keywords reflection markerschain-of-thoughtdecoding controllarge reasoning modelstest-time scalinguncertainty estimationreasoning efficiency

0 comments

The pith

PathCal uses reflection-marker distributions to intervene only at uncertain reasoning states and shorten outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reflection markers such as 'wait' and 'but' play distinct functional roles and exert their strongest influence before a reasoning trajectory stabilizes. It shows through suppression and prefix experiments that treating markers as a single category misses these differences. PathCal therefore monitors the marker distribution at each step to detect when evidence for a competing branch grows excessive, then softly rebalances logits only at those locally uncertain points. The result is shorter generations that still reach correct answers on complex tasks. This approach operates without training, external verifiers, or extra sampling.

Core claim

PathCal is a training-free decoding controller that distinguishes marker types and intervenes only at locally uncertain states: at each step it uses the distribution over reflection markers to estimate competition between the current trajectory and a competing branch, then softly rebalances marker logits when competing-branch evidence becomes excessive, yielding a better efficiency-performance trade-off across six reasoning benchmarks.

What carries the argument

PathCal controller, which estimates local competition from the reflection-marker distribution and rebalances logits only when competing-branch evidence grows excessive.

If this is right

Accuracy is preserved or improved while generation length decreases on six reasoning benchmarks.
Intervention occurs only before the model settles into a stable trajectory.
Different marker classes produce distinct effects on accuracy versus length.
No external verifiers or additional sampling steps are required.
The method works as a lightweight addition to existing decoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same marker-distribution signal might be usable to detect other forms of local uncertainty beyond explicit reflection tokens.
If marker probabilities prove diagnostic in one family of models, similar hesitation signals could be mined in models that lack explicit markers.
PathCal-style local rebalancing could be combined with existing length-penalty or early-exit heuristics to produce further efficiency gains.

Load-bearing premise

The distribution over reflection markers at each decoding step supplies a reliable estimate of local competition between the current trajectory and any competing branch.

What would settle it

Running PathCal on the same six benchmarks and finding that average generation length increases or accuracy drops would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.23074 by Dengzhe Hou, Fangzhou Lin, Kazunori Yamada, Lingyu Jiang, Peiran Li, Shuo Xing, Tsubasa Takahashi, Zhengzhong Tu, Zirui Li.

**Figure 1.** Figure 1: Illustration of our proposed method, PATHCAL. PATHCAL calibrates local reasoning-path choices by softly reweighting continuation and competing-branch markers when the current trajectory is at risk of unnecessary switching. This assumption is questionable: markers such as “so”, “but”, and “wait” appear to express distinct reasoning transitions [3]. This raises a basic question: Are reflection markers actual… view at source ↗

**Figure 2.** Figure 2: Type-wise suppression on AIME2025 using DeepSeekR1-Distill-Qwen-7B. ility at the decoding level by selectively suppressing different marker classes. If reflection markers formed a homogeneous control class, then suppressing different marker types should produce similar directional effects on generation behavior, such as comparable changes in accuracy and length. In addition, suppressing all markers shou… view at source ↗

**Figure 3.** Figure 3: Fixed-prefix branch intervention. With the same input and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy vs. generation length on THQA. Colors indicate methods and markers indicate model families [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Hyperparameter sensitivity of PATHCAL on MATH500. Left: sensitivity to intervention strength α. Right: sensitivity to alternativemarker weight λA. The red star marks the default configuration. Hyperparameter sensitivity [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

The emergence of Large Reasoning Language Models (LRMs) has paved the way for tackling complex reasoning tasks through test-time scaling by generating long-form Chain-of-Thought (CoT) trajectories during inference. Meanwhile, these trajectories often contain explicit reflection markers such as ``wait'', ``but'', and ``alternatively'', signaling hesitation, revision, and the consideration of alternative explorations, respectively. Recent studies on test-time control leverage such markers as lightweight handles for steering reasoning, typically treating them as a single coarse-grained category rather than distinguishing their distinct functional roles. In this paper, we conduct type-wise suppression and fixed-prefix intervention, revealing that reflection markers differ not only in their functional roles but also in when they exert the greatest influence. Specifically, different marker classes affect accuracy and generation length in distinct ways, and marker choices are most consequential before the model settles into a stable reasoning trajectory. Motivated by these findings, we introduce PathCal, a novel training-free decoding controller that calibrates reasoning paths by distinguishing marker types and intervening only at locally uncertain states. At each decoding step, PathCal utilizes the distribution over reflection-markers to estimate local competition between maintaining the current reasoning trajectory and initiating a competing branch, and softly rebalances marker logits when competing-branch evidence becomes excessive. Experiments across six reasoning benchmarks demonstrate that PathCal achieves a better efficiency--performance trade-off, improving or preserving accuracy while reducing generation length, without relying on external verifiers or additional sampling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PathCal, a training-free decoding controller for large reasoning models (LRMs) that distinguishes functional roles of reflection markers (e.g., 'wait', 'but', 'alternatively') via type-wise suppression and fixed-prefix experiments. It intervenes only at locally uncertain states by using the marker distribution to estimate competition between the current trajectory and competing branches, then softly rebalances logits when competing-branch probability is excessive. Experiments on six reasoning benchmarks are reported to show improved or preserved accuracy with reduced generation length, without external verifiers or additional sampling.

Significance. If the empirical claims hold, the work is significant for providing a lightweight, parameter-free, distribution-driven method to improve the efficiency-performance trade-off in test-time scaling of LRMs. The type-wise analysis of markers and the focus on local intervention without external components are strengths that distinguish it from prior test-time control approaches.

major comments (2)

[Results section] Results section: the central empirical claim of a better efficiency-performance trade-off rests on reported outcomes across six benchmarks, yet the manuscript supplies no baselines, error bars, exclusion rules, or statistical tests; this prevents assessment of whether improvements reflect post-hoc selection or fitting artifacts rather than the proposed local intervention.
[Method section] Method section (PathCal description): the claim that marker distribution estimates 'local competition' between trajectories is load-bearing for the intervention rule, but the manuscript does not specify the exact threshold or rebalancing function, leaving open whether the heuristic reduces to a quantity defined from the same data.

minor comments (2)

[Abstract] Abstract: the list of six benchmarks is not named, which would help readers immediately gauge the scope of the evaluation.
[Introduction] Notation: the terms 'locally uncertain states' and 'competing-branch evidence' are used without an explicit definition or equation in the early sections, which could be clarified for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback on our manuscript. We address each major comment below.

read point-by-point responses

Referee: [Results section] Results section: the central empirical claim of a better efficiency-performance trade-off rests on reported outcomes across six benchmarks, yet the manuscript supplies no baselines, error bars, exclusion rules, or statistical tests; this prevents assessment of whether improvements reflect post-hoc selection or fitting artifacts rather than the proposed local intervention.

Authors: We acknowledge the referee's concern regarding the presentation of results. The manuscript reports performance across six benchmarks but indeed lacks error bars, statistical tests, and explicit exclusion rules. We will update the results section to include error bars from multiple runs, specify baselines more clearly, and incorporate statistical significance tests to better substantiate the efficiency-performance trade-offs. revision: yes
Referee: [Method section] Method section (PathCal description): the claim that marker distribution estimates 'local competition' between trajectories is load-bearing for the intervention rule, but the manuscript does not specify the exact threshold or rebalancing function, leaving open whether the heuristic reduces to a quantity defined from the same data.

Authors: We thank the referee for highlighting this issue. The description of PathCal indicates that the marker distribution is used to estimate local competition and that logits are softly rebalanced when competing-branch evidence is excessive. To address the lack of specificity, we will provide the exact threshold value and the mathematical form of the rebalancing function in the revised method section, ensuring the intervention rule is fully specified and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description contain no equations, fitted parameters, or self-citations that reduce any claimed prediction or result to its own inputs by construction. PathCal is presented as a training-free heuristic that uses observed marker distributions for local intervention, with the efficiency-performance trade-off supported by direct benchmark experiments rather than any definitional or fitted equivalence. The derivation chain relies on empirical type-wise suppression findings that are independent of the final controller outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about marker semantics and timing; no free parameters or invented entities are stated in the abstract.

axioms (2)

domain assumption Reflection markers differ in functional roles and exert greatest influence before the model settles into a stable trajectory.
This premise directly motivates the type-wise analysis and the decision to intervene only at locally uncertain states.
domain assumption Marker logit distributions reliably indicate local competition between current trajectory and competing branches.
This underpins the soft rebalancing rule at each decoding step.

pith-pipeline@v0.9.0 · 5819 in / 1237 out tokens · 28287 ms · 2026-05-25T05:20:00.791740+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

g_t = 4 C_t B_t / (C_t + B_t)^2 + ε ... α_t = α_base g_t min{[B_t − C_t + γ]_+ / τ, 1}
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

PATHCAL uses marker probabilities to detect local reasoning-mode competition and softly rebalances marker logits when competing-branch evidence becomes excessive

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 13 canonical work pages · 7 internal anchors

[1]

L1: Controlling how long a reasoning model thinks with reinforcement learning, 2025

Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning, 2025

2025
[2]

Aimo validation amc

AI-MO. Aimo validation amc. https://huggingface.co/datasets/AI-MO/ aimo-validation-amc, 2024

2024
[3]

Interpretation of discourse connectives is probabilistic: Evidence from the study of but and although.Discourse Processes, 57(4):376–399, 2020

Fatemeh Torabi Asr and Vera Demberg. Interpretation of discourse connectives is probabilistic: Evidence from the study of but and although.Discourse Processes, 57(4):376–399, 2020

2020
[4]

Aytes, Jinheon Baek, and Sung Ju Hwang

Simon A. Aytes, Jinheon Baek, and Sung Ju Hwang. Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching, 2025

2025
[5]

Math- arena: Evaluating llms on uncontaminated math competitions, 2026

Mislav Balunovi´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. Math- arena: Evaluating llms on uncontaminated math competitions, 2026

2026
[6]

Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy

Paul C. Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy. Thought anchors: Which llm reasoning steps matter?, 2025

2025
[7]

Le, Christopher Ré, and Azalia Mirhoseini

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024

2024
[8]

Unveiling the latent directions of reflection in large language models, 2025

Fu-Chieh Chang, Yu-Ting Lee, and Pei-Yuan Wu. Unveiling the latent directions of reflection in large language models, 2025

2025
[9]

Directional reasoning trajectory change (drtc): Identifying critical trace segments in reasoning models, 2026

Waldemar Chang. Directional reasoning trajectory change (drtc): Identifying critical trace segments in reasoning models, 2026

2026
[10]

TheoremQA: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. TheoremQA: A theorem-driven question answering dataset. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7889–7901, Singapore, December 2023. Association for C...

2023
[11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

A language anchor-guided method for robust noisy domain generalization.arXiv preprint arXiv:2503.17211, 2025

Zilin Dai, Lehong Wang, Fangzhou Lin, Yidong Wang, Zhigang Li, Kazunori D Yamada, Ziming Zhang, and Wang Lu. A language anchor-guided method for robust noisy domain generalization.arXiv preprint arXiv:2503.17211, 2025

work page arXiv 2025
[13]

Do thinking tokens help or trap? towards more efficient large reasoning model, 2025

Bowen Ding, Yuhan Chen, Futing Wang, Lingfeng Ming, and Tao Lin. Do thinking tokens help or trap? towards more efficient large reasoning model, 2025

2025
[14]

Hero, and Sijia Liu

Chongyu Fan, Yihua Zhang, Jinghan Jia, Alfred O. Hero, and Sijia Liu. Cyclicreflex: Improving reasoning models via cyclical reflection token scheduling. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[15]

Alphazero-like tree-search can guide large language model decoding and training, 2024

Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024

2024
[16]

What charac- terizes effective reasoning? revisiting length, review, and structure of cot, 2025

Yunzhen Feng, Julia Kempe, Cheng Zhang, Parag Jain, and Anthony Hartshorn. What charac- terizes effective reasoning? revisiting length, review, and structure of cot, 2025

2025
[17]

Efficiently scaling llm reasoning with certaindex, 2025

Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Yonghao Zhuang, Yian Ma, Aurick Qiao, Tajana Rosing, Ion Stoica, and Hao Zhang. Efficiently scaling llm reasoning with certaindex, 2025

2025
[18]

Rogov, Elena Tutubalina, and Ivan Oseledets

Andrey Galichin, Alexey Dontsov, Polina Druzhinina, Anton Razzhigaev, Oleg Y . Rogov, Elena Tutubalina, and Ivan Oseledets. I have covered all the bases here: Interpreting reasoning features in large language models via sparse autoencoders, 2025. 10

2025
[19]

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D. Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars, 2025

2025
[20]

The llama 3 herd of models, 2024

Aaron Grattafiori et al. The llama 3 herd of models, 2024

2024
[21]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Token-budget-aware llm reasoning, 2025

Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware llm reasoning, 2025

2025
[23]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

2021
[24]

WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking

Dengzhe Hou, Lingyu Jiang, Deng Li, Zirui Li, Fangzhou Lin, and Kazunori D Yamada. Wmf- am: Probing llm working memory via depth-parameterized cumulative state tracking.arXiv preprint arXiv:2603.27343, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Shijue Huang, Hongru Wang, Wanjun Zhong, Zhaochen Su, Jiazhan Feng, Bowen Cao, and Yi R. Fung. Adactrl: Towards adaptive and controllable reasoning via difficulty-aware budgeting, 2025

2025
[26]

TimePre: Bridging Accuracy, Efficiency, and Stability in Probabilistic Time-Series Forecasting

Lingyu Jiang, Lingyu Xu, Peiran Li, Qianwen Ge, Dingyi Zhuang, Shuo Xing, Wenjing Chen, Xiangbo Gao, Ting-Hsuan Chen, Xueying Zhan, et al. Timepre: Bridging accuracy, efficiency, and stability in probabilistic time-series forecasting.arXiv preprint arXiv:2511.18539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

First try matters: Revisiting the role of reflection in reasoning models, 2025

Liwei Kang, Yue Deng, Yao Xiao, Zhanfeng Mo, Wee Sun Lee, and Lidong Bing. First try matters: Revisiting the role of reflection in reasoning models, 2025

2025
[28]

Large language models are zero-shot reasoners, 2023

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023

2023
[29]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023

2023
[30]

Bowman, and Ethan Perez

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoši¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Lar- son, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy ...

2023
[31]

Inference- time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. InThirty-seventh Confer- ence on Neural Information Processing Systems, 2023

2023
[32]

Let the abyss stare back adaptive falsification for autonomous scientific discovery.arXiv preprint arXiv:2603.29045, 2026

Peiran Li, Fangzhou Lin, Shuo Xing, Jiashuo Sun, Dylan Zhang, Siyuan Yang, Chaoqun Ni, and Zhengzhong Tu. Let the abyss stare back adaptive falsification for autonomous scientific discovery.arXiv preprint arXiv:2603.29045, 2026

work page arXiv 2026
[33]

Bibagent: An agentic framework for traceable miscitation detection in scientific literature.arXiv preprint arXiv:2601.16993, 2026

Peiran Li, Fangzhou Lin, Shuo Xing, Xiang Zheng, Xi Hong, Siyuan Yang, Jiashuo Sun, Zhengzhong Tu, and Chaoqun Ni. Bibagent: An agentic framework for traceable miscitation detection in scientific literature.arXiv preprint arXiv:2601.16993, 2026

work page arXiv 2026
[34]

Traversal-as-policy: Log-distilled gated behavior trees as externalized, verifiable policies for safe, robust, and efficient agents.arXiv preprint arXiv:2603.05517, 2026

Peiran Li, Jiashuo Sun, Fangzhou Lin, Shuo Xing, Tianfu Fu, Suofei Feng, Chaoqun Ni, and Zhengzhong Tu. Traversal-as-policy: Log-distilled gated behavior trees as externalized, verifiable policies for safe, robust, and efficient agents.arXiv preprint arXiv:2603.05517, 2026. 11

work page arXiv 2026
[35]

Contrastive decoding: Open-ended text generation as optimization

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Lon...

2023
[36]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024

2024
[37]

Position: Human-centric ai requires a minimum viable level of human understanding.arXiv preprint arXiv:2602.00854, 2026

Fangzhou Lin, Qianwen Ge, Lingyu Xu, Peiran Li, Xiangbo Gao, Shuo Xing, Kazunori Yamada, Ziming Zhang, Haichong Zhang, and Zhengzhong Tu. Position: Human-centric ai requires a minimum viable level of human understanding.arXiv preprint arXiv:2602.00854, 2026

work page arXiv 2026
[38]

AdaptFuse: Training-Free Sequential Preference Learning via Externalized Bayesian Inference

Fangzhou Lin, Peiran Li, Shuo Xing, Siyuan Yang, Qianwen Ge, Kazunori Yamada, Ziming Zhang, Haichong Zhang, and Zhengzhong Tu. Adaptfuse: Training-free sequential preference learning via externalized bayesian inference.arXiv preprint arXiv:2604.03925, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning

Fangzhou Lin, Shuo Xing, Peiran Li, Siyuan Yang, Qianwen Ge, Kazunori Yamada, Ziming Zhang, Haichong Zhang, and Zhengzhong Tu. Caps: Cascaded adaptive pairwise selection for efficient parallel reasoning.arXiv preprint arXiv:2605.15513, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Cot-valve: Length-compressible chain-of-thought tuning, 2025

Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, and Xinchao Wang. Cot-valve: Length-compressible chain-of-thought tuning, 2025

2025
[41]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InThirty-seventh Conference on Neural Informati...

2023
[42]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, et al. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

introducing-o3-and-o4-mini.OpenAI Blog, 2025

OpenAI. introducing-o3-and-o4-mini.OpenAI Blog, 2025

2025
[44]

Plum: Prompt learning using metaheuristics

Rui Pan, Shuo Xing, Shizhe Diao, Wenhe Sun, Xiang Liu, KaShun Shum, Jipeng Zhang, Renjie Pi, and Tong Zhang. Plum: Prompt learning using metaheuristics. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 2177–2197, Bangkok, Thailand, August 2024. Association for Computationa...

2024
[45]

Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning, 2024

Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning, 2024

2024
[46]

Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in llm reasoning, 2025

Chen Qian, Dongrui Liu, Haochen Wen, Zhen Bai, Yong Liu, and Jing Shao. Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in llm reasoning, 2025

2025
[47]

Concise: Confidence-guided compression in step-by-step efficient reasoning.Proceedings of EMNLP, 2025

Ziqing Qiao, Yongheng Deng, Jiali Zeng, Dong Wang, et al. Concise: Confidence-guided compression in step-by-step efficient reasoning.Proceedings of EMNLP, 2025

2025
[48]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

2025
[49]

Qwq: Reflect deeply on the boundaries of the unknown, 2025

Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, 2025. 12

2025
[50]

Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[51]

Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms, 2025

Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms, 2025

2025
[52]

Thinking by subtraction: Confidence-driven contrastive decoding for llm reasoning, 2026

Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin, Bang Yang, and Yuexian Zou. Thinking by subtraction: Confidence-driven contrastive decoding for llm reasoning, 2026

2026
[53]

Kimi k2: Open agentic intelligence, 2026

Kimi Team. Kimi k2: Open agentic intelligence, 2026

2026
[54]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023

2023
[55]

Do large language model benchmarks test reliability?, 2025

Joshua Vendrow, Edward Vendrow, Sara Beery, and Aleksander Madry. Do large language model benchmarks test reliability?, 2025

2025
[56]

Investigating gender bias in language models using causal mediation analysis

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 12388–12401. Curran Associa...

2020
[57]

Adapthink: Adaptive thinking preferences for reasoning language model, 2025

Xu Wan, Wei Wang, Wenyue Xu, Wotao Yin, Jie Song, and Mingyang Sun. Adapthink: Adaptive thinking preferences for reasoning language model, 2025

2025
[58]

Self-consistency improves chain of thought reasoning in language models.International Conference on Learning Representations (ICLR), 2023

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.International Conference on Learning Representations (ICLR), 2023

2023
[59]

R1-compress: Long chain-of-thought compression via chunk compression and search.arXiv preprint, 2025

Yibo Wang, Haotian Luo, Li Shen, et al. R1-compress: Long chain-of-thought compression via chunk compression and search.arXiv preprint, 2025

2025
[60]

Thoughts are all over the place: On the underthinking of o1-like llms.arXiv preprint arXiv:2501.18585, 2025

Yue Wang et al. Thoughts are all over the place: On the underthinking of o1-like llms.arXiv preprint arXiv:2501.18585, 2025

work page arXiv 2025
[61]

Reasoning-finetuning repurposes latent representations in base models, 2025

Jake Ward, Chuqiao Lin, Constantin Venhoff, and Neel Nanda. Reasoning-finetuning repurposes latent representations in base models, 2025

2025
[62]

Chi, Quoc V Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022

2022
[63]

It’s not that simple

Guojun Wu. It’s not that simple. an analysis of simple test-time scaling, 2025

2025
[64]

Tokenskip: Controllable chain-of-thought compression in llms.Proceedings of EMNLP, 2025

Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms.Proceedings of EMNLP, 2025

2025
[65]

Chain of draft: Thinking faster by writing less, 2025

Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. Chain of draft: Thinking faster by writing less, 2025

2025
[66]

A*-thought: Efficient reasoning via bidirectional compression for low-resource settings.arXiv preprint, 2025

Xiaoang Xu, Shuo Wang, Xu Han, et al. A*-thought: Efficient reasoning via bidirectional compression for low-resource settings.arXiv preprint, 2025

2025
[67]

Dynamic early exit in reasoning models, 2025

Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning models, 2025

2025
[68]

Wong, and Di Wang

Shu Yang, Junchao Wu, Xin Chen, Yunze Xiao, Xinyi Yang, Derek F. Wong, and Di Wang. Understanding aha moments: from external observations to internal mechanisms, 2025

2025
[69]

Towards thinking-optimal scaling of test-time compute for llm reasoning, 2025

Wenkai Yang, Shuming Ma, Yankai Lin, and Furu Wei. Towards thinking-optimal scaling of test-time compute for llm reasoning, 2025. 13

2025
[70]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023

2023
[71]

Understanding hyperbolic metric learning through hard negative sampling

Yun Yue, Fangzhou Lin, Guanyi Mou, and Ziming Zhang. Understanding hyperbolic metric learning through hard negative sampling. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1891–1903, 2024

1903
[72]

Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities?, 2025

Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Yunhua Zhou, and Xipeng Qiu. Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities?, 2025

2025
[73]

Gps: A probabilistic distributional similarity with gumbel priors for set-to-set matching

Ziming Zhang, Fangzhou Lin, Haotian Liu, Jose Morales, Haichong Zhang, Kazunori Yamada, Vijaya B Kolachalama, and Venkatesh Saligrama. Gps: A probabilistic distributional similarity with gumbel priors for set-to-set matching. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[74]

Deep loss convexification for learning iterative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1501–1513, 2024

Ziming Zhang, Yuping Shao, Yiqing Zhang, Fangzhou Lin, Haichong Zhang, and Elke Runden- steiner. Deep loss convexification for learning iterative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1501–1513, 2024

2024
[75]

the answer is

Jiachen Zhao, Yiyou Sun, Weiyan Shi, and Dawn Song. Can aha moments be fake? identifying true and decorative thinking steps in chain-of-thought, 2026. 14 A Complete Experimental Setup Models.We evaluate four open-source reasoning language models that span scales, backbones, and distillation pipelines:DeepSeek-R1-Distill-Qwen-7B,DeepSeek-R1-Distill-Qwen-14...

2026

[1] [1]

L1: Controlling how long a reasoning model thinks with reinforcement learning, 2025

Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning, 2025

2025

[2] [2]

Aimo validation amc

AI-MO. Aimo validation amc. https://huggingface.co/datasets/AI-MO/ aimo-validation-amc, 2024

2024

[3] [3]

Interpretation of discourse connectives is probabilistic: Evidence from the study of but and although.Discourse Processes, 57(4):376–399, 2020

Fatemeh Torabi Asr and Vera Demberg. Interpretation of discourse connectives is probabilistic: Evidence from the study of but and although.Discourse Processes, 57(4):376–399, 2020

2020

[4] [4]

Aytes, Jinheon Baek, and Sung Ju Hwang

Simon A. Aytes, Jinheon Baek, and Sung Ju Hwang. Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching, 2025

2025

[5] [5]

Math- arena: Evaluating llms on uncontaminated math competitions, 2026

Mislav Balunovi´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. Math- arena: Evaluating llms on uncontaminated math competitions, 2026

2026

[6] [6]

Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy

Paul C. Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy. Thought anchors: Which llm reasoning steps matter?, 2025

2025

[7] [7]

Le, Christopher Ré, and Azalia Mirhoseini

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024

2024

[8] [8]

Unveiling the latent directions of reflection in large language models, 2025

Fu-Chieh Chang, Yu-Ting Lee, and Pei-Yuan Wu. Unveiling the latent directions of reflection in large language models, 2025

2025

[9] [9]

Directional reasoning trajectory change (drtc): Identifying critical trace segments in reasoning models, 2026

Waldemar Chang. Directional reasoning trajectory change (drtc): Identifying critical trace segments in reasoning models, 2026

2026

[10] [10]

TheoremQA: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. TheoremQA: A theorem-driven question answering dataset. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7889–7901, Singapore, December 2023. Association for C...

2023

[11] [11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[12] [12]

A language anchor-guided method for robust noisy domain generalization.arXiv preprint arXiv:2503.17211, 2025

Zilin Dai, Lehong Wang, Fangzhou Lin, Yidong Wang, Zhigang Li, Kazunori D Yamada, Ziming Zhang, and Wang Lu. A language anchor-guided method for robust noisy domain generalization.arXiv preprint arXiv:2503.17211, 2025

work page arXiv 2025

[13] [13]

Do thinking tokens help or trap? towards more efficient large reasoning model, 2025

Bowen Ding, Yuhan Chen, Futing Wang, Lingfeng Ming, and Tao Lin. Do thinking tokens help or trap? towards more efficient large reasoning model, 2025

2025

[14] [14]

Hero, and Sijia Liu

Chongyu Fan, Yihua Zhang, Jinghan Jia, Alfred O. Hero, and Sijia Liu. Cyclicreflex: Improving reasoning models via cyclical reflection token scheduling. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[15] [15]

Alphazero-like tree-search can guide large language model decoding and training, 2024

Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024

2024

[16] [16]

What charac- terizes effective reasoning? revisiting length, review, and structure of cot, 2025

Yunzhen Feng, Julia Kempe, Cheng Zhang, Parag Jain, and Anthony Hartshorn. What charac- terizes effective reasoning? revisiting length, review, and structure of cot, 2025

2025

[17] [17]

Efficiently scaling llm reasoning with certaindex, 2025

Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Yonghao Zhuang, Yian Ma, Aurick Qiao, Tajana Rosing, Ion Stoica, and Hao Zhang. Efficiently scaling llm reasoning with certaindex, 2025

2025

[18] [18]

Rogov, Elena Tutubalina, and Ivan Oseledets

Andrey Galichin, Alexey Dontsov, Polina Druzhinina, Anton Razzhigaev, Oleg Y . Rogov, Elena Tutubalina, and Ivan Oseledets. I have covered all the bases here: Interpreting reasoning features in large language models via sparse autoencoders, 2025. 10

2025

[19] [19]

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D. Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars, 2025

2025

[20] [20]

The llama 3 herd of models, 2024

Aaron Grattafiori et al. The llama 3 herd of models, 2024

2024

[21] [21]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Token-budget-aware llm reasoning, 2025

Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware llm reasoning, 2025

2025

[23] [23]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

2021

[24] [24]

WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking

Dengzhe Hou, Lingyu Jiang, Deng Li, Zirui Li, Fangzhou Lin, and Kazunori D Yamada. Wmf- am: Probing llm working memory via depth-parameterized cumulative state tracking.arXiv preprint arXiv:2603.27343, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Shijue Huang, Hongru Wang, Wanjun Zhong, Zhaochen Su, Jiazhan Feng, Bowen Cao, and Yi R. Fung. Adactrl: Towards adaptive and controllable reasoning via difficulty-aware budgeting, 2025

2025

[26] [26]

TimePre: Bridging Accuracy, Efficiency, and Stability in Probabilistic Time-Series Forecasting

Lingyu Jiang, Lingyu Xu, Peiran Li, Qianwen Ge, Dingyi Zhuang, Shuo Xing, Wenjing Chen, Xiangbo Gao, Ting-Hsuan Chen, Xueying Zhan, et al. Timepre: Bridging accuracy, efficiency, and stability in probabilistic time-series forecasting.arXiv preprint arXiv:2511.18539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

First try matters: Revisiting the role of reflection in reasoning models, 2025

Liwei Kang, Yue Deng, Yao Xiao, Zhanfeng Mo, Wee Sun Lee, and Lidong Bing. First try matters: Revisiting the role of reflection in reasoning models, 2025

2025

[28] [28]

Large language models are zero-shot reasoners, 2023

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023

2023

[29] [29]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023

2023

[30] [30]

Bowman, and Ethan Perez

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoši¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Lar- son, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy ...

2023

[31] [31]

Inference- time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. InThirty-seventh Confer- ence on Neural Information Processing Systems, 2023

2023

[32] [32]

Let the abyss stare back adaptive falsification for autonomous scientific discovery.arXiv preprint arXiv:2603.29045, 2026

Peiran Li, Fangzhou Lin, Shuo Xing, Jiashuo Sun, Dylan Zhang, Siyuan Yang, Chaoqun Ni, and Zhengzhong Tu. Let the abyss stare back adaptive falsification for autonomous scientific discovery.arXiv preprint arXiv:2603.29045, 2026

work page arXiv 2026

[33] [33]

Bibagent: An agentic framework for traceable miscitation detection in scientific literature.arXiv preprint arXiv:2601.16993, 2026

Peiran Li, Fangzhou Lin, Shuo Xing, Xiang Zheng, Xi Hong, Siyuan Yang, Jiashuo Sun, Zhengzhong Tu, and Chaoqun Ni. Bibagent: An agentic framework for traceable miscitation detection in scientific literature.arXiv preprint arXiv:2601.16993, 2026

work page arXiv 2026

[34] [34]

Traversal-as-policy: Log-distilled gated behavior trees as externalized, verifiable policies for safe, robust, and efficient agents.arXiv preprint arXiv:2603.05517, 2026

Peiran Li, Jiashuo Sun, Fangzhou Lin, Shuo Xing, Tianfu Fu, Suofei Feng, Chaoqun Ni, and Zhengzhong Tu. Traversal-as-policy: Log-distilled gated behavior trees as externalized, verifiable policies for safe, robust, and efficient agents.arXiv preprint arXiv:2603.05517, 2026. 11

work page arXiv 2026

[35] [35]

Contrastive decoding: Open-ended text generation as optimization

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Lon...

2023

[36] [36]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024

2024

[37] [37]

Position: Human-centric ai requires a minimum viable level of human understanding.arXiv preprint arXiv:2602.00854, 2026

Fangzhou Lin, Qianwen Ge, Lingyu Xu, Peiran Li, Xiangbo Gao, Shuo Xing, Kazunori Yamada, Ziming Zhang, Haichong Zhang, and Zhengzhong Tu. Position: Human-centric ai requires a minimum viable level of human understanding.arXiv preprint arXiv:2602.00854, 2026

work page arXiv 2026

[38] [38]

AdaptFuse: Training-Free Sequential Preference Learning via Externalized Bayesian Inference

Fangzhou Lin, Peiran Li, Shuo Xing, Siyuan Yang, Qianwen Ge, Kazunori Yamada, Ziming Zhang, Haichong Zhang, and Zhengzhong Tu. Adaptfuse: Training-free sequential preference learning via externalized bayesian inference.arXiv preprint arXiv:2604.03925, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning

Fangzhou Lin, Shuo Xing, Peiran Li, Siyuan Yang, Qianwen Ge, Kazunori Yamada, Ziming Zhang, Haichong Zhang, and Zhengzhong Tu. Caps: Cascaded adaptive pairwise selection for efficient parallel reasoning.arXiv preprint arXiv:2605.15513, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

Cot-valve: Length-compressible chain-of-thought tuning, 2025

Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, and Xinchao Wang. Cot-valve: Length-compressible chain-of-thought tuning, 2025

2025

[41] [41]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InThirty-seventh Conference on Neural Informati...

2023

[42] [42]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, et al. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

introducing-o3-and-o4-mini.OpenAI Blog, 2025

OpenAI. introducing-o3-and-o4-mini.OpenAI Blog, 2025

2025

[44] [44]

Plum: Prompt learning using metaheuristics

Rui Pan, Shuo Xing, Shizhe Diao, Wenhe Sun, Xiang Liu, KaShun Shum, Jipeng Zhang, Renjie Pi, and Tong Zhang. Plum: Prompt learning using metaheuristics. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 2177–2197, Bangkok, Thailand, August 2024. Association for Computationa...

2024

[45] [45]

Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning, 2024

Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning, 2024

2024

[46] [46]

Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in llm reasoning, 2025

Chen Qian, Dongrui Liu, Haochen Wen, Zhen Bai, Yong Liu, and Jing Shao. Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in llm reasoning, 2025

2025

[47] [47]

Concise: Confidence-guided compression in step-by-step efficient reasoning.Proceedings of EMNLP, 2025

Ziqing Qiao, Yongheng Deng, Jiali Zeng, Dong Wang, et al. Concise: Confidence-guided compression in step-by-step efficient reasoning.Proceedings of EMNLP, 2025

2025

[48] [48]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

2025

[49] [49]

Qwq: Reflect deeply on the boundaries of the unknown, 2025

Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, 2025. 12

2025

[50] [50]

Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time com- pute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[51] [51]

Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms, 2025

Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms, 2025

2025

[52] [52]

Thinking by subtraction: Confidence-driven contrastive decoding for llm reasoning, 2026

Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin, Bang Yang, and Yuexian Zou. Thinking by subtraction: Confidence-driven contrastive decoding for llm reasoning, 2026

2026

[53] [53]

Kimi k2: Open agentic intelligence, 2026

Kimi Team. Kimi k2: Open agentic intelligence, 2026

2026

[54] [54]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023

2023

[55] [55]

Do large language model benchmarks test reliability?, 2025

Joshua Vendrow, Edward Vendrow, Sara Beery, and Aleksander Madry. Do large language model benchmarks test reliability?, 2025

2025

[56] [56]

Investigating gender bias in language models using causal mediation analysis

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 12388–12401. Curran Associa...

2020

[57] [57]

Adapthink: Adaptive thinking preferences for reasoning language model, 2025

Xu Wan, Wei Wang, Wenyue Xu, Wotao Yin, Jie Song, and Mingyang Sun. Adapthink: Adaptive thinking preferences for reasoning language model, 2025

2025

[58] [58]

Self-consistency improves chain of thought reasoning in language models.International Conference on Learning Representations (ICLR), 2023

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.International Conference on Learning Representations (ICLR), 2023

2023

[59] [59]

R1-compress: Long chain-of-thought compression via chunk compression and search.arXiv preprint, 2025

Yibo Wang, Haotian Luo, Li Shen, et al. R1-compress: Long chain-of-thought compression via chunk compression and search.arXiv preprint, 2025

2025

[60] [60]

Thoughts are all over the place: On the underthinking of o1-like llms.arXiv preprint arXiv:2501.18585, 2025

Yue Wang et al. Thoughts are all over the place: On the underthinking of o1-like llms.arXiv preprint arXiv:2501.18585, 2025

work page arXiv 2025

[61] [61]

Reasoning-finetuning repurposes latent representations in base models, 2025

Jake Ward, Chuqiao Lin, Constantin Venhoff, and Neel Nanda. Reasoning-finetuning repurposes latent representations in base models, 2025

2025

[62] [62]

Chi, Quoc V Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022

2022

[63] [63]

It’s not that simple

Guojun Wu. It’s not that simple. an analysis of simple test-time scaling, 2025

2025

[64] [64]

Tokenskip: Controllable chain-of-thought compression in llms.Proceedings of EMNLP, 2025

Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms.Proceedings of EMNLP, 2025

2025

[65] [65]

Chain of draft: Thinking faster by writing less, 2025

Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. Chain of draft: Thinking faster by writing less, 2025

2025

[66] [66]

A*-thought: Efficient reasoning via bidirectional compression for low-resource settings.arXiv preprint, 2025

Xiaoang Xu, Shuo Wang, Xu Han, et al. A*-thought: Efficient reasoning via bidirectional compression for low-resource settings.arXiv preprint, 2025

2025

[67] [67]

Dynamic early exit in reasoning models, 2025

Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning models, 2025

2025

[68] [68]

Wong, and Di Wang

Shu Yang, Junchao Wu, Xin Chen, Yunze Xiao, Xinyi Yang, Derek F. Wong, and Di Wang. Understanding aha moments: from external observations to internal mechanisms, 2025

2025

[69] [69]

Towards thinking-optimal scaling of test-time compute for llm reasoning, 2025

Wenkai Yang, Shuming Ma, Yankai Lin, and Furu Wei. Towards thinking-optimal scaling of test-time compute for llm reasoning, 2025. 13

2025

[70] [70]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023

2023

[71] [71]

Understanding hyperbolic metric learning through hard negative sampling

Yun Yue, Fangzhou Lin, Guanyi Mou, and Ziming Zhang. Understanding hyperbolic metric learning through hard negative sampling. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1891–1903, 2024

1903

[72] [72]

Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities?, 2025

Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Yunhua Zhou, and Xipeng Qiu. Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities?, 2025

2025

[73] [73]

Gps: A probabilistic distributional similarity with gumbel priors for set-to-set matching

Ziming Zhang, Fangzhou Lin, Haotian Liu, Jose Morales, Haichong Zhang, Kazunori Yamada, Vijaya B Kolachalama, and Venkatesh Saligrama. Gps: A probabilistic distributional similarity with gumbel priors for set-to-set matching. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[74] [74]

Deep loss convexification for learning iterative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1501–1513, 2024

Ziming Zhang, Yuping Shao, Yiqing Zhang, Fangzhou Lin, Haichong Zhang, and Elke Runden- steiner. Deep loss convexification for learning iterative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1501–1513, 2024

2024

[75] [75]

the answer is

Jiachen Zhao, Yiyou Sun, Weiyan Shi, and Dawn Song. Can aha moments be fake? identifying true and decorative thinking steps in chain-of-thought, 2026. 14 A Complete Experimental Setup Models.We evaluate four open-source reasoning language models that span scales, backbones, and distillation pipelines:DeepSeek-R1-Distill-Qwen-7B,DeepSeek-R1-Distill-Qwen-14...

2026