Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing

Changyi Yang; Chenyang Zhao; Dhruv Pai; Jinman Zhao; Jin Pan; Julian McAuley; Mingyu Jin; Raymond Li; Shuming Hu; Wujiang Xu

arxiv: 2606.25354 · v2 · pith:S225HB2Qnew · submitted 2026-06-24 · 💻 cs.CL

Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing

Yutong Yin , Mingyu Jin , Jin Pan , Changyi Yang , Zijie Xia , Dhruv Pai , Shuming Hu , Zhen Zhang

show 7 more authors

Chenyang Zhao Jinman Zhao Wujiang Xu Raymond Li Xin Eric Wang Julian McAuley Zhaoran Wang

This is my paper

Pith reviewed 2026-07-01 06:31 UTC · model grok-4.3

classification 💻 cs.CL

keywords local branch routingtest-time scalinglanguage model reasoningreinforcement learningmathematical reasoningchain of thoughttoken-level branchinglookahead tree

0 comments

The pith

Local Branch Routing scales language-model reasoning at test time by routing short local lookahead branches on hidden states and training the router jointly with RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Local Branch Routing as a token-level method for test-time scaling that grows a small lookahead tree, runs the branches through the model, and lets a lightweight router pick the best depth-1 subtree using the hidden states after those branches. This keeps decisions local and discrete while giving the router evidence beyond the root next-token distribution. A prune-shift-grow process preserves branch identities so a tractable tree-trajectory likelihood can be defined, allowing end-to-end reinforcement learning with verifiable rewards under the same ratio principle used for discrete-token methods. On synthetic hierarchical-planning tasks the router demonstrably benefits from the post-candidate states. On mathematical reasoning benchmarks the full system raises both Pass@1 and Pass@32 over chain-of-thought, vanilla discrete RLVR, and other soft-token branching baselines.

Core claim

Local Branch Routing expands a small local lookahead tree at each token, forwards all sampled branches through the language model, and routes over their hidden states to commit to one depth-1 subtree. The resulting prune-shift-grow decoding keeps discrete branch identities intact and yields an explicit tree-trajectory likelihood in which newly grown nodes are counted on first sampling and router decisions receive explicit probabilities. This likelihood supports joint end-to-end reinforcement learning of the base model and router. Experiments confirm that post-candidate hidden states supply useful routing evidence on synthetic tasks and that the method improves Pass@1 and Pass@32 on mathemati

What carries the argument

The lightweight router that selects among depth-1 subtrees by examining the hidden states produced after candidate branches are forwarded through the model.

If this is right

Both Pass@1 and Pass@32 rise on mathematical reasoning benchmarks relative to discrete chain-of-thought and other RL baselines.
The tree-trajectory likelihood makes end-to-end reinforcement learning of base model and router possible under the same ratio principle as discrete-token RLVR.
On synthetic hierarchical-planning tasks the router demonstrably benefits from post-candidate hidden states rather than root next-token information alone.
The prune-shift-grow process keeps computation local while still permitting multi-step lookahead at each token without full solution-level search.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the router remains small, the method could be stacked with larger base models or longer overall reasoning traces without proportional cost growth.
The same local-branching pattern might transfer to domains where next-token uncertainty is high but full search is impractical.
An ablation that freezes the base model after initial RL and trains only the router would isolate how much of the gain comes from the hidden-state routing signal itself.

Load-bearing premise

Post-candidate hidden states contain routing evidence beyond the root next-token distribution that a lightweight router can usefully exploit to pick effective depth-1 subtrees.

What would settle it

A controlled experiment in which the router is given only the root next-token distribution versus the full post-candidate hidden states and shows no accuracy gain on held-out math problems would falsify the claim that those states supply useful additional evidence.

Figures

Figures reproduced from arXiv: 2606.25354 by Changyi Yang, Chenyang Zhao, Dhruv Pai, Jinman Zhao, Jin Pan, Julian McAuley, Mingyu Jin, Raymond Li, Shuming Hu, Wujiang Xu, Xin Eric Wang, Yutong Yin, Zhaoran Wang, Zhen Zhang, Zijie Xia.

**Figure 1.** Figure 1: Local Branch Routing decoding pipeline. LBR maintains a rolling local tree of already-forwarded candidate continuations. At each step, the router uses hidden states from all nodes in the current tree to select one depth-1 subtree, commits its root token, prunes the other subtrees, shifts the selected subtree forward, and grows one new layer to restore depth L. The top row shows the main experimental settin… view at source ↗

**Figure 2.** Figure 2: Set-attention router. The router first encodes each depth-1 candidate subtree independently into a vector gt,k. For the main L = 1 setting, a candidate subtree consists of a single forwarded token, so gt,k is computed from its post-token hidden state. For L = 2, the subtree encoder summarizes the hidden states of the candidate root and its local continuations. The resulting candidate vectors are then pass… view at source ↗

**Figure 3.** Figure 3: Radix-translated graph reachability and decoding behavior. Left: a concept-level reachability [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Synthetic hierarchical-planning results. Left: LBR achieves the highest target accuracy [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Router ablation. The full contrastive router outperforms the independent router, showing that comparing sibling branches improves routing [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 7.** Figure 7: Concept-identity probe after the first generated graph node. Discrete CoT and LBR preserve [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

Test-time scaling improves language-model reasoning, but existing approaches often face a difficult trade-off: long chain-of-thought sampling remains single-threaded, while sentence- or solution-level search can be computationally expensive and hard to train end-to-end. We introduce Local Branch Routing (LBR), a token-level test-time scaling framework that expands a small local lookahead tree, forwards all sampled branches through the language model, and uses a lightweight router to select the depth-1 subtree to commit. By routing over the hidden states of candidate local futures, LBR allows each token decision to use evidence beyond the root next-token distribution while avoiding full solution-level search. The resulting prune-shift-grow decoding process preserves discrete branch identities and defines a tractable tree-trajectory likelihood: newly grown nodes are counted when first sampled, and router decisions are assigned explicit probabilities. This enables end-to-end reinforcement learning with verifiable rewards, jointly optimizing the base model and router under the same likelihood-ratio principle as discrete-token RLVR. On synthetic hierarchical-planning tasks, LBR shows that post-candidate hidden states provide useful routing evidence. On mathematical reasoning benchmarks, LBR improves both Pass@1 and Pass@32 over discrete chain-of-thought, vanilla discrete-token RLVR, and RL-compatible soft-token branching baselines. These results suggest that lightweight local branching offers an efficient, trainable, and discrete form of language-model test-time scaling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LBR gives a workable token-level branching scheme with explicit likelihood for joint RL, but the reported gains rest on unshown experimental controls.

read the letter

The core new piece is the local branch routing setup: at each token the model grows a small depth-1 tree, routes over the hidden states of those candidates, and commits to one subtree. They define a tree-trajectory likelihood that counts newly sampled nodes and assigns probabilities to router choices, which lets them run end-to-end RL on both the base model and the router under the same likelihood-ratio objective used in discrete RLVR.

That construction is the main technical step forward. It keeps branch identities discrete, avoids the credit-assignment problems that usually block RL on soft branching, and stays cheaper than sentence- or solution-level search. The abstract also reports that synthetic hierarchical-planning tasks confirm the router uses post-candidate states usefully, and that the method lifts both Pass@1 and Pass@32 on math benchmarks over chain-of-thought, vanilla RLVR, and RL-compatible soft-token baselines.

The main soft spot is that none of the experimental details are visible here. We do not see how many seeds were run, how the router was sized, what the exact data splits were, or whether the Pass@32 numbers reflect the same compute budget as the baselines. Without those, it is difficult to judge whether the router is genuinely extracting extra signal or whether the gains could be explained by other factors. The central assumption—that hidden states after the candidate tokens carry routing value beyond the root distribution—remains plausible but untested in the provided text.

The work is aimed at researchers who already care about inference-time scaling and want something between single-thread sampling and full tree search. The framework looks coherent on its own terms and the likelihood construction removes an obvious obstacle to training, so the paper is worth sending to referees who can check the experiments and derivations in detail.

Referee Report

1 major / 2 minor

Summary. The paper introduces Local Branch Routing (LBR), a token-level test-time scaling framework for language models. It expands small local lookahead trees at each token, forwards branches through the model, and employs a lightweight router over post-candidate hidden states to select the depth-1 subtree to commit to. The prune-shift-grow process preserves discrete branch identities and defines a tractable tree-trajectory likelihood that supports end-to-end RL with verifiable rewards, jointly optimizing the base model and router. Experiments on synthetic hierarchical-planning tasks validate the utility of post-candidate hidden states for routing, while mathematical reasoning benchmarks show gains in both Pass@1 and Pass@32 over discrete CoT, vanilla discrete-token RLVR, and RL-compatible soft-token branching baselines.

Significance. If the results hold, LBR offers a practical middle ground between single-threaded sampling and expensive solution-level search by enabling efficient, trainable, discrete test-time scaling with end-to-end RL. The explicit construction of the tree-trajectory likelihood under the likelihood-ratio principle is a strength, as is the demonstration that routing over local futures can improve both single-sample and multi-sample performance without full-tree search.

major comments (1)

[Experiments on mathematical reasoning benchmarks] The central empirical claim on math benchmarks depends on the router successfully exploiting post-candidate hidden states beyond the root next-token distribution. While synthetic tasks are said to demonstrate this utility, the manuscript should include a direct ablation (e.g., router input variants) showing that the hidden-state evidence is load-bearing for the reported Pass@1/Pass@32 gains rather than an artifact of the local tree expansion alone.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a concise comparison table of computational cost (tokens evaluated per decision) versus the baselines to make the efficiency claim more immediate.
[Methods] Notation for the tree-trajectory likelihood (newly grown nodes counted on first sample, router decisions assigned explicit probabilities) should be formalized with an equation in the methods section for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address the major comment below.

read point-by-point responses

Referee: The central empirical claim on math benchmarks depends on the router successfully exploiting post-candidate hidden states beyond the root next-token distribution. While synthetic tasks are said to demonstrate this utility, the manuscript should include a direct ablation (e.g., router input variants) showing that the hidden-state evidence is load-bearing for the reported Pass@1/Pass@32 gains rather than an artifact of the local tree expansion alone.

Authors: We agree that a direct ablation on the math benchmarks would strengthen the central claim by isolating the contribution of post-candidate hidden states. We will add an ablation comparing router input variants (root next-token distribution only versus full post-candidate hidden states) on the mathematical reasoning tasks and report the resulting Pass@1/Pass@32 differences to confirm that the hidden-state evidence is load-bearing. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines a tree-trajectory likelihood by counting newly sampled nodes and assigning explicit probabilities to router decisions; this construction is presented as enabling end-to-end RL under the same likelihood-ratio principle as discrete-token RLVR, without reducing to fitted router parameters or prior results by definition. The central empirical claims (Pass@1/Pass@32 gains on math benchmarks) are downstream consequences of the routing mechanism exploiting post-candidate hidden states, validated separately on synthetic tasks. No self-citation load-bearing steps, fitted-input predictions, or ansatz smuggling appear in the derivation chain. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities; full text required for ledger construction.

pith-pipeline@v0.9.1-grok · 5829 in / 1127 out tokens · 30468 ms · 2026-07-01T06:31:08.692161+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 27 canonical work pages · 10 internal anchors

[1]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

The internal state of an LLM knows when it’s lying

Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, Singapore, December 2023. As- sociation for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.68. URL https://aclantholog...

work page doi:10.18653/v1/2023.findings-emnlp.68 2023
[3]

Probing classifiers: Promises, shortcomings, and advances.Computa- tional Linguistics, 48(1):207–219, March 2022

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computa- tional Linguistics, 48(1):207–219, March 2022. doi: 10.1162/coli_a_00422. URL https: //aclanthology.org/2022.cl-1.7/

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
[4]

Soft tokens, hard truths.arXiv preprint arXiv:2509.19170, 2025

Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, and Yann Ollivier. Soft tokens, hard truths.arXiv preprint arXiv:2509.19170, 2025. 12

work page arXiv 2025
[5]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

work page arXiv 2024
[7]

Scaling speculative decoding with lookahead reasoning.arXiv preprint arXiv:2506.19830, 2025

Yichao Fu, Rui Ge, Zelei Shao, Zhijie Deng, and Hao Zhang. Scaling speculative decoding with lookahead reasoning.arXiv preprint arXiv:2506.19830, 2025

work page arXiv 2025
[8]

Continuous chain of thought enables parallel exploration and reasoning.arXiv preprint arXiv:2505.23648, 2025

Halil Alperen Gozeten, M Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and reasoning.arXiv preprint arXiv:2505.23648, 2025

work page arXiv 2025
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, 2023

2023
[11]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Designing and interpreting probes with control tasks

John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, Chi...

work page doi:10.18653/v1/d19-1275 2019
[13]

Treerl: Llm reinforcement learning with on-policy tree search

Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. Treerl: Llm reinforcement learning with on-policy tree search. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12355–12369, 2025

2025
[14]

Wider or deeper? scaling llm inference-time compute with adaptive branching tree search.arXiv preprint arXiv:2503.04412, 2025

Yuichi Inoue, Kou Misaki, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. Wider or deeper? scaling llm inference-time compute with adaptive branching tree search.arXiv preprint arXiv:2503.04412, 2025

work page arXiv 2025
[15]

Mingyu Jin, Qinkai Yu, Jingyuan Huang, Qingcheng Zeng, Zhenting Wang, Wenyue Hua, Haiyan Zhao, Kai Mei, Yanda Meng, Kaize Ding, Fan Yang, Mengnan Du, and Yongfeng Zhang. Exploring concept depth: How large language models acquire knowledge and concept at different layers? In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio,...

2025
[16]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023. 13

2023
[17]

Policy guided tree search for enhanced llm reasoning.arXiv preprint arXiv:2502.06813, 2025

Yang Li. Policy guided tree search for enhanced llm reasoning.arXiv preprint arXiv:2502.06813, 2025

work page arXiv 2025
[18]

Treepo: Bridging the gap of policy optimiza- tion and efficacy and inference efficiency with heuristic tree-based modeling.arXiv preprint arXiv:2508.17445, 2025

Yizhi Li, Qingshui Gu, Zhoufutu Wen, Ziniu Li, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xin Zhou, Xingwei Qu, Wangchunshu Zhou, et al. Treepo: Bridging the gap of policy optimiza- tion and efficacy and inference efficiency with heuristic tree-based modeling.arXiv preprint arXiv:2508.17445, 2025

work page arXiv 2025
[19]

Reward-guided speculative decoding for efficient llm reasoning.arXiv preprint arXiv:2501.19324, 2025

Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, and Caiming Xiong. Reward-guided speculative decoding for efficient llm reasoning.arXiv preprint arXiv:2501.19324, 2025

work page arXiv 2025
[20]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpass- ing o1-preview with a 1.5b model by scaling rl.https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 ,
[21]

Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification.arXiv preprint arXiv:2305.09781, 2023

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification.arXiv preprint arXiv:2305.09781, 2023

work page arXiv 2023
[22]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332, 2025

2025
[23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Multiplex thinking: Reasoning via token-wise branch- and-merge.arXiv preprint arXiv:2601.08808,

Yao Tang, Li Dong, Yaru Hao, Qingxiu Dong, Furu Wei, and Jiatao Gu. Multiplex thinking: Reasoning via token-wise branch-and-merge.arXiv preprint arXiv:2601.08808, 2026

work page arXiv 2026
[26]

Efficient reasoning for llms through speculative chain-of-thought.arXiv preprint arXiv:2504.19095, 2025

Jikai Wang, Juntao Li, Jianye Hou, Bowen Yan, Lijun Wu, and Min Zhang. Efficient reasoning for llms through speculative chain-of-thought.arXiv preprint arXiv:2504.19095, 2025

work page arXiv 2025
[27]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022
[29]

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

Fang Wu, Weihao Xuan, Heli Qi, Ximing Lu, Aaron Tu, Li Erran Li, and Yejin Choi. Deepsearch: Overcome the bottleneck of reinforcement learning with verifiable rewards via monte carlo tree search.arXiv preprint arXiv:2509.25454, 2025. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Llms are single-threaded reasoners: Demystifying the working mechanism of soft thinking.arXiv preprint arXiv:2508.03440, 2025

Junhong Wu, Jinliang Lu, Zixuan Ren, Gangqiang Hu, Zhi Wu, Dai Dai, and Hua Wu. Llms are single-threaded reasoners: Demystifying the working mechanism of soft thinking.arXiv preprint arXiv:2508.03440, 2025

work page arXiv 2025
[31]

arXiv preprint arXiv:2405.00451 , year=

Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning. arXiv preprint arXiv:2405.00451, 2024

work page arXiv 2024
[32]

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

2023
[33]

Soft thinking: Unlocking the reasoning potential of llms in continuous concept space.arXiv preprint arXiv:2505.15778, 2025

Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space.arXiv preprint arXiv:2505.15778, 2025

work page arXiv 2025
[34]

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Grow local tree

Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning by superposition: A theoretical perspective on chain of continuous thought.arXiv preprint arXiv:2505.12514, 2025. 15 A Complete Decoding Framework A.1 Rolling Local Lookahead Tree Let x<t denote the committed prefix before decoding positiont. Standard autoregress...

work page arXiv 2025

[1] [1]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

The internal state of an LLM knows when it’s lying

Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, Singapore, December 2023. As- sociation for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.68. URL https://aclantholog...

work page doi:10.18653/v1/2023.findings-emnlp.68 2023

[3] [3]

Probing classifiers: Promises, shortcomings, and advances.Computa- tional Linguistics, 48(1):207–219, March 2022

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computa- tional Linguistics, 48(1):207–219, March 2022. doi: 10.1162/coli_a_00422. URL https: //aclanthology.org/2022.cl-1.7/

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022

[4] [4]

Soft tokens, hard truths.arXiv preprint arXiv:2509.19170, 2025

Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, and Yann Ollivier. Soft tokens, hard truths.arXiv preprint arXiv:2509.19170, 2025. 12

work page arXiv 2025

[5] [5]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

work page arXiv 2024

[7] [7]

Scaling speculative decoding with lookahead reasoning.arXiv preprint arXiv:2506.19830, 2025

Yichao Fu, Rui Ge, Zelei Shao, Zhijie Deng, and Hao Zhang. Scaling speculative decoding with lookahead reasoning.arXiv preprint arXiv:2506.19830, 2025

work page arXiv 2025

[8] [8]

Continuous chain of thought enables parallel exploration and reasoning.arXiv preprint arXiv:2505.23648, 2025

Halil Alperen Gozeten, M Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and reasoning.arXiv preprint arXiv:2505.23648, 2025

work page arXiv 2025

[9] [9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, 2023

2023

[11] [11]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Designing and interpreting probes with control tasks

John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, Chi...

work page doi:10.18653/v1/d19-1275 2019

[13] [13]

Treerl: Llm reinforcement learning with on-policy tree search

Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. Treerl: Llm reinforcement learning with on-policy tree search. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12355–12369, 2025

2025

[14] [14]

Wider or deeper? scaling llm inference-time compute with adaptive branching tree search.arXiv preprint arXiv:2503.04412, 2025

Yuichi Inoue, Kou Misaki, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. Wider or deeper? scaling llm inference-time compute with adaptive branching tree search.arXiv preprint arXiv:2503.04412, 2025

work page arXiv 2025

[15] [15]

Mingyu Jin, Qinkai Yu, Jingyuan Huang, Qingcheng Zeng, Zhenting Wang, Wenyue Hua, Haiyan Zhao, Kai Mei, Yanda Meng, Kaize Ding, Fan Yang, Mengnan Du, and Yongfeng Zhang. Exploring concept depth: How large language models acquire knowledge and concept at different layers? In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio,...

2025

[16] [16]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023. 13

2023

[17] [17]

Policy guided tree search for enhanced llm reasoning.arXiv preprint arXiv:2502.06813, 2025

Yang Li. Policy guided tree search for enhanced llm reasoning.arXiv preprint arXiv:2502.06813, 2025

work page arXiv 2025

[18] [18]

Treepo: Bridging the gap of policy optimiza- tion and efficacy and inference efficiency with heuristic tree-based modeling.arXiv preprint arXiv:2508.17445, 2025

Yizhi Li, Qingshui Gu, Zhoufutu Wen, Ziniu Li, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xin Zhou, Xingwei Qu, Wangchunshu Zhou, et al. Treepo: Bridging the gap of policy optimiza- tion and efficacy and inference efficiency with heuristic tree-based modeling.arXiv preprint arXiv:2508.17445, 2025

work page arXiv 2025

[19] [19]

Reward-guided speculative decoding for efficient llm reasoning.arXiv preprint arXiv:2501.19324, 2025

Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, and Caiming Xiong. Reward-guided speculative decoding for efficient llm reasoning.arXiv preprint arXiv:2501.19324, 2025

work page arXiv 2025

[20] [20]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpass- ing o1-preview with a 1.5b model by scaling rl.https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 ,

[21] [21]

Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification.arXiv preprint arXiv:2305.09781, 2023

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification.arXiv preprint arXiv:2305.09781, 2023

work page arXiv 2023

[22] [22]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332, 2025

2025

[23] [23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Multiplex thinking: Reasoning via token-wise branch- and-merge.arXiv preprint arXiv:2601.08808,

Yao Tang, Li Dong, Yaru Hao, Qingxiu Dong, Furu Wei, and Jiatao Gu. Multiplex thinking: Reasoning via token-wise branch-and-merge.arXiv preprint arXiv:2601.08808, 2026

work page arXiv 2026

[26] [26]

Efficient reasoning for llms through speculative chain-of-thought.arXiv preprint arXiv:2504.19095, 2025

Jikai Wang, Juntao Li, Jianye Hou, Bowen Yan, Lijun Wu, and Min Zhang. Efficient reasoning for llms through speculative chain-of-thought.arXiv preprint arXiv:2504.19095, 2025

work page arXiv 2025

[27] [27]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022

[29] [29]

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

Fang Wu, Weihao Xuan, Heli Qi, Ximing Lu, Aaron Tu, Li Erran Li, and Yejin Choi. Deepsearch: Overcome the bottleneck of reinforcement learning with verifiable rewards via monte carlo tree search.arXiv preprint arXiv:2509.25454, 2025. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Llms are single-threaded reasoners: Demystifying the working mechanism of soft thinking.arXiv preprint arXiv:2508.03440, 2025

Junhong Wu, Jinliang Lu, Zixuan Ren, Gangqiang Hu, Zhi Wu, Dai Dai, and Hua Wu. Llms are single-threaded reasoners: Demystifying the working mechanism of soft thinking.arXiv preprint arXiv:2508.03440, 2025

work page arXiv 2025

[31] [31]

arXiv preprint arXiv:2405.00451 , year=

Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning. arXiv preprint arXiv:2405.00451, 2024

work page arXiv 2024

[32] [32]

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

2023

[33] [33]

Soft thinking: Unlocking the reasoning potential of llms in continuous concept space.arXiv preprint arXiv:2505.15778, 2025

Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space.arXiv preprint arXiv:2505.15778, 2025

work page arXiv 2025

[34] [34]

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Grow local tree

Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning by superposition: A theoretical perspective on chain of continuous thought.arXiv preprint arXiv:2505.12514, 2025. 15 A Complete Decoding Framework A.1 Rolling Local Lookahead Tree Let x<t denote the committed prefix before decoding positiont. Standard autoregress...

work page arXiv 2025