pith. machine review for the scientific record. sign in

arxiv: 2605.06908 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Same Signal, Opposite Meaning: Direction-Informed Adaptive Learning for LLM Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords adaptive test-time computeLLM agentsgating signalsdirection learningcounterfactual explorationsuccess-cost trade-offrollout utility
0
0 comments X

The pith

Gating signals for extra LLM agent compute reverse their meaning across environments and models, so direction must be learned per setting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agents decide whether to spend extra test-time computation such as rollouts based on signals like uncertainty or difficulty. Standard methods assume these signals always point the same way: higher signal means extra compute is more likely to help. The paper demonstrates that the direction is unstable, with the same signal improving outcomes in one environment or model but harming them in another even when the task stays fixed. This occurs because signals can mark either states where alternatives are worth comparing or states where the current context makes rollouts unsuitable. DIAL learns the correct direction for each environment and backbone from counterfactual rollouts, producing a better balance of task success against compute cost than fixed-direction approaches.

Core claim

The paper shows that alignment between gating signals and rollout utility is unstable: the identical signal predicts performance gain in some (environment, backbone) combinations and performance loss in others. This instability stems from the distinction between compute need and compute suitability under a two-source model. Fixed-direction gates therefore risk selecting precisely the states where extra computation reduces final outcome quality. DIAL trains a sparse gate on labels from signal-agnostic counterfactual exploration to recover the utility direction of state features for each specific (environment, backbone) pair, delivering stronger success-cost trade-offs across six environments.

What carries the argument

DIAL, a sparse gate trained via counterfactual exploration to recover the per-setting utility direction of state features.

If this is right

  • Wrong-direction gates degrade performance by routing extra compute exactly to states where it reduces outcome quality.
  • Compute need and compute suitability are distinct sources, so any fixed-direction rule becomes unreliable when environments or backbones change.
  • Learning the direction from data per (environment, backbone) produces a stronger overall success-cost trade-off than fixed baselines.
  • The two-source model explains why uncertainty signals can indicate either helpful decision difficulty or unhelpful intervention contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adaptive compute methods may require per-deployment calibration rather than one-size-fits-all rules if reversals prove common.
  • The counterfactual labeling approach could extend to other adaptive decisions such as retrieval or tool selection in agents.
  • Online variants of DIAL might update directions during operation without a separate exploration phase.
  • Hybrid gates that combine learned direction with other signals could improve robustness when exploration data is limited.

Load-bearing premise

Counterfactual exploration produces reliable labels for the true utility direction of state features without selection bias.

What would settle it

Measure actual rollout benefit or harm on held-out states for a new environment-backbone pair; if DIAL's learned directions disagree with these measured outcomes on more than a small fraction of states, the learned gate does not recover the correct utility direction.

Figures

Figures reproduced from arXiv: 2605.06908 by Chuxu Zhang, Guilin Wang, Jiatan Huang, Xiaoguang Guo, Ziming Li.

Figure 1
Figure 1. Figure 1: The fixed-direction assumption fails on half of the environments. Spearman corre￾lation ρ(entropy, U) across six environments on Qwen3-4B. Answering this requires understanding why di￾rection reverses in the first place. Hidden in the fixed-direction assumption is a deeper conflation: uncertainty or difficulty measures whether a state may benefit from more compute, but not whether that compute will actuall… view at source ↗
Figure 2
Figure 2. Figure 2: The DIAL pipeline. (a) Explore: paired counterfactual rollouts via Bernoulli-ε yield raw data Draw={(state text, R(aT )−R(aB))}. (b) Reason: from each state text, a candidate fea￾ture pool ϕcand(s) combines universal features with LLM-proposed task-specific features, yielding D={(ϕ(s), U)}. (c) Learn: fit an ℓ1-regularized logistic gate g(s); signed weights jointly select features and recover per-environme… view at source ↗
Figure 3
Figure 3. Figure 3: Gate complexity ablation on Hot￾potQA (Qwen3-4B). With correct direction all gates reach ≈95%; with wrong direction. Gate capacity does not substitute for direction. A natural objection is that DIAL’s gains come from gate simplicity rather than learned direction. With correct direction, a logistic gate, a 2-layer MLP, and a hidden-state probe (2560-d) all reach ≈95% SR on HotpotQA: capacity adds <1% ( [PI… view at source ↗
Figure 4
Figure 4. Figure 4: P1: ρ(entropy, U) shifts from early (blue, step ≤ median) to late (red); bars show Spearman ρ. Plancraft omitted (|ρ|=0.016). The empirical landscape (§3.2) shows direc￾tion varies but does not pin down the Two￾Source Model (§3.3) as the cause. We test the model with three complementary experiments that each rule out a different class of alternative explanation: within-episode temporal dynam￾ics (P1, obser… view at source ↗
Figure 5
Figure 5. Figure 5: Prompt template used in the LLM feature layer (Phase 2) to propose task-specific features [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Complete LLM-generated feature extraction function for WebShop (seed 42), the environ [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: SR vs. Cost (×base) across all 6 environments on Qwen3-4B. Shaded region: dominated by DIAL (lower SR and higher cost). Full numerical results across three backbones are in [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Multi-backbone signal direction heatmap. [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cross-backbone comparison: DIAL (red) vs. best fixed-direction baseline (blue) across 3 [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Environment-adaptive trigger behavior. Panels ordered by rollout headroom [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗
read the original abstract

Adaptive test-time compute for LLM agents aims to invoke extra computation only when it improves performance. Existing methods typically use confidence-, uncertainty-, or difficulty-based gates, assuming a fixed direction from the gating signal through compute need to the value of computation. This makes gating a utility-calibration problem: gating signals should align with whether extra computation improves the final outcome over the base policy. We show that this alignment is unstable: the same signal predicts rollout benefit in one setting and rollout harm in another, with reversals across environments and backbones even when the task is fixed. Wrong-direction gates can therefore worsen performance by precisely selecting harmful states. This reversal reflects a deeper distinction between compute need and compute suitability: a high uncertainty signal may indicate decision-difficult states where rollouts help compare alternatives, or intervention-unsuitable states where the current context does not support useful rollout-based improvement. Under this two-source model, fixed-direction gates are unreliable across heterogeneous settings. To address this, we propose DIAL (Direction-Informed Adaptive Learning), a sparse gate trained from signal-agnostic counterfactual exploration to learn the utility direction of state features per (environment, backbone). Across six environments and three backbones, DIAL yields a stronger overall success-cost trade-off than fixed-direction baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that standard gating signals (e.g., uncertainty or confidence) for adaptive test-time compute in LLM agents exhibit unstable alignment with the utility of extra computation: the same signal can predict rollout benefit in one setting and harm in another, with reversals across environments and backbones even for fixed tasks. This is explained via a two-source model distinguishing compute need from compute suitability. The authors propose DIAL, a sparse gate trained via signal-agnostic counterfactual exploration to learn per-(environment, backbone) utility directions from observed outcomes, and report that it achieves a stronger overall success-cost trade-off than fixed-direction baselines across six environments and three backbones.

Significance. If the empirical results hold and the counterfactual labels are unbiased, the work is significant for LLM agent and test-time scaling research: it identifies a fundamental instability in existing adaptive-compute assumptions and supplies a data-driven mechanism to learn directions rather than assuming fixed alignment. The grounding of direction learning in counterfactual rollouts is a methodological strength that avoids pure self-reference.

major comments (3)
  1. [Method (DIAL)] Method section (DIAL training procedure): the central claim that DIAL learns the 'true' utility direction per (env, backbone) from counterfactual exploration requires that the exploration procedure generates unbiased labels for marginal value of compute. No derivation or analysis shows the estimator remains unbiased under heterogeneous base-policy state visitation distributions; states the base policy already handles well are likely underrepresented, so observed reversals could be sampling artifacts rather than intrinsic signal instability.
  2. [Experiments] Experimental results (across six environments and three backbones): the abstract and results claim reversals and superior success-cost trade-offs, yet no quantitative details (e.g., sign-flip magnitudes, correlation values before/after DIAL, or ablation isolating the direction-learning component) are supplied to verify that gains derive from direction learning rather than other factors in the sparse gate.
  3. [Introduction / Two-source model] Two-source model (compute need vs. suitability): the model is invoked to explain why fixed-direction gates fail, but no formalization or testable prediction distinguishes the two sources in a way that would allow falsification of the instability claim independent of the learned gate.
minor comments (2)
  1. [Method] Notation for the sparse gate and counterfactual labels should be defined more explicitly with symbols to avoid ambiguity when comparing to fixed-direction baselines.
  2. [Abstract] The abstract would benefit from one concrete numerical example of a reversal (e.g., correlation sign change) to ground the instability claim before the method is introduced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential significance of identifying instability in standard gating signals for adaptive test-time compute. We address each major comment below with clarifications and commitments to revision where appropriate. Our responses focus on substance and aim to strengthen the manuscript without overstating current content.

read point-by-point responses
  1. Referee: [Method (DIAL)] Method section (DIAL training procedure): the central claim that DIAL learns the 'true' utility direction per (env, backbone) from counterfactual exploration requires that the exploration procedure generates unbiased labels for marginal value of compute. No derivation or analysis shows the estimator remains unbiased under heterogeneous base-policy state visitation distributions; states the base policy already handles well are likely underrepresented, so observed reversals could be sampling artifacts rather than intrinsic signal instability.

    Authors: We acknowledge that the manuscript lacks a formal derivation establishing unbiasedness of the counterfactual estimator for arbitrary base-policy visitation distributions. The procedure samples states from trajectories generated by the base policy and directly observes marginal outcomes via paired rollouts with and without extra compute; this yields empirical utility labels conditioned on the encountered distribution. While underrepresentation of well-handled states is possible, it aligns with the deployment distribution the gate must handle. We will revise the method section to explicitly discuss this limitation, add a note on potential sampling effects, and include sensitivity checks (e.g., reweighting or additional uniform sampling) to assess robustness of the learned directions. revision: partial

  2. Referee: [Experiments] Experimental results (across six environments and three backbones): the abstract and results claim reversals and superior success-cost trade-offs, yet no quantitative details (e.g., sign-flip magnitudes, correlation values before/after DIAL, or ablation isolating the direction-learning component) are supplied to verify that gains derive from direction learning rather than other factors in the sparse gate.

    Authors: We agree that additional quantitative breakdowns would strengthen verification that performance gains stem specifically from direction learning. The current manuscript emphasizes aggregate success-cost trade-offs across settings, but we will expand the experiments section to report: sign-flip frequencies and magnitudes in signal-utility correlations across (environment, backbone) pairs; pre- and post-DIAL correlation values; and an ablation that holds the sparse gate architecture fixed while comparing learned directions against fixed-direction variants. These additions will isolate the contribution of the direction-learning component. revision: yes

  3. Referee: [Introduction / Two-source model] Two-source model (compute need vs. suitability): the model is invoked to explain why fixed-direction gates fail, but no formalization or testable prediction distinguishes the two sources in a way that would allow falsification of the instability claim independent of the learned gate.

    Authors: The two-source model is presented as a conceptual distinction to interpret the observed reversals: compute need captures whether a state is decision-difficult, while suitability captures whether extra compute can productively improve outcomes given the current context. This framework directly predicts that fixed-direction assumptions will be unreliable across heterogeneous settings, a prediction tested via the empirical reversals and DIAL's relative gains. We will revise the introduction to state the model's core assumptions more explicitly and clarify how the counterfactual-based evaluation provides an independent test of instability (via performance of fixed versus learned gates) without circularity. revision: partial

Circularity Check

0 steps flagged

No significant circularity: direction learning grounded in external counterfactual data

full rationale

The paper defines DIAL as a sparse gate trained directly on outcomes from signal-agnostic counterfactual exploration to learn per-(env, backbone) utility directions. This training procedure uses observed rollout results as labels, making the learned gate an empirical fit to external data rather than a self-referential definition or renamed input. The instability claim rests on reported reversals across environments and backbones, presented as empirical observations rather than derived by construction from the two-source model. No equations, self-citations, or uniqueness theorems are invoked in the abstract or description to force the result; the method remains self-contained against the provided benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so full details on parameters and assumptions cannot be audited. The proposal rests on learning directions from counterfactual data and the two-source model of compute need versus suitability.

axioms (1)
  • domain assumption The two-source model (compute need vs. compute suitability) explains observed signal reversals.
    Invoked to justify why fixed-direction gates are unreliable across heterogeneous settings.
invented entities (1)
  • DIAL sparse gate no independent evidence
    purpose: Learns per-(environment, backbone) utility direction of state features from counterfactual exploration
    New method component introduced to address direction instability; no independent evidence outside the paper is described.

pith-pipeline@v0.9.0 · 5534 in / 1369 out tokens · 65000 ms · 2026-05-11T00:48:10.199502+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages

  1. [1]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InICLR, 2023

  2. [2]

    Re- flexion: language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: language agents with verbal reinforcement learning. InNeurIPS, 2023

  3. [3]

    Language agent tree search unifies reasoning, acting, and planning in language models

    Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In ICML, 2024

  4. [4]

    Mass: Mathematical data selection via skill graphs for pretraining large language models

    Jiazheng Li, Lu Yu, Qing Cui, Zhiqiang Zhang, Jun Zhou, Yanfang Ye, and Chuxu Zhang. Mass: Mathematical data selection via skill graphs for pretraining large language models. In ICML, 2025

  5. [5]

    Graph is a substrate across data modalities.CoRR, 2026

    Ziming Li, Xiaoming Wu, Zehong Wang, Jiazheng Li, Yijun Tian, Jinhe Bi, Yunpu Ma, Yan- fang Ye, and Chuxu Zhang. Graph is a substrate across data modalities.CoRR, 2026

  6. [6]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023

  7. [7]

    Reasoning with language model is planning with world model

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InEMNLP, 2023

  8. [8]

    Scaling LLM test-time compute optimally can be more effective than scaling model parameters.CoRR, 2024

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.CoRR, 2024

  9. [9]

    Why reasoning fails to plan: A planning-centric analysis of long-horizon decision making in LLM agents.CoRR, 2026

    Zehong Wang, Fang Wu, Hongru Wang, Xiangru Tang, Bolian Li, Zhenfei Yin, Yijun Ma, Yiyang Li, Weixiang Sun, Xiusi Chen, and Yanfang Ye. Why reasoning fails to plan: A planning-centric analysis of long-horizon decision making in LLM agents.CoRR, 2026

  10. [10]

    Scaling test-time compute for LLM agents.CoRR, 2025

    King Zhu, Hanhao Li, Siwei Wu, Tianshun Xing, Dehua Ma, Xiangru Tang, Minghao Liu, Jian Yang, Jiaheng Liu, Yuchen Eleanor Jiang, Changwang Zhang, Chenghua Lin, Jun Wang, Ge Zhang, and Wangchunshu Zhou. Scaling test-time compute for LLM agents.CoRR, 2025

  11. [11]

    Semantic exploration with adap- tive gating for efficient problem solving with language models

    Sungjae Lee, Hyejin Park, Jaechang Kim, and Jungseul Ok. Semantic exploration with adap- tive gating for efficient problem solving with language models. InACL, 2025

  12. [12]

    Cats: Cali- brated test-time scaling for efficient llm reasoning

    Chengsong Huang, Langlin Huang, Jixuan Leng, Jiacheng Liu, and Jiaxin Huang. Cats: Cali- brated test-time scaling for efficient llm reasoning. InICLR, 2026

  13. [13]

    Corefine: Confidence-guided self- refinement for adaptive test-time compute.CoRR, 2026

    Chen Jin, Ryutaro Tanno, Tom Diethe, and Philip Teare. Corefine: Confidence-guided self- refinement for adaptive test-time compute.CoRR, 2026

  14. [14]

    Cand `es, and Tatsunori Hashimoto

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel J. Cand `es, and Tatsunori Hashimoto. s1: Simple test-time scaling. InEMNLP, 2025

  15. [15]

    Token-budget-aware LLM reasoning

    Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware LLM reasoning. InACL, 2025

  16. [16]

    Ma- honey, Kurt Keutzer, and Amir Gholami

    Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W. Ma- honey, Kurt Keutzer, and Amir Gholami. Agentic test-time scaling for webagents.CoRR, 2026

  17. [17]

    Diffadapt: Difficulty-adaptive rea- soning for token-efficient LLM inference.CoRR, 2025

    Xiang Liu, Xuming Hu, Xiaowen Chu, and Eunsol Choi. Diffadapt: Difficulty-adaptive rea- soning for token-efficient LLM inference.CoRR, 2025

  18. [18]

    The LLM already knows: Estimating llm-perceived question difficulty via hidden representations

    Yubo Zhu, Dongrui Liu, Zecheng Lin, Wei Tong, Sheng Zhong, and Jing Shao. The LLM already knows: Estimating llm-perceived question difficulty via hidden representations. In EMNLP, 2025. 10

  19. [19]

    Adaptthink: Reasoning models can learn when to think

    Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. Adaptthink: Reasoning models can learn when to think. InEMNLP, 2025

  20. [20]

    Thinkless: LLM learns when to think.CoRR, 2025

    Gongfan Fang, Xinyin Ma, and Xinchao Wang. Thinkless: LLM learns when to think.CoRR, 2025

  21. [21]

    The interpretation of interaction in contingency tables.Journal of the Royal Statistical Society, 1951

    Edward H Simpson. The interpretation of interaction in contingency tables.Journal of the Royal Statistical Society, 1951

  22. [22]

    Learning to discover various simpson’s paradoxes

    Jingwei Wang, Jianshan He, Weidi Xu, Ruopeng Li, and Wei Chu. Learning to discover various simpson’s paradoxes. InKDD, 2023

  23. [23]

    Think just enough: Sequence-level entropy as a confidence signal for LLM reasoning.CoRR, 2025

    Aman Sharma and Paras Chopra. Think just enough: Sequence-level entropy as a confidence signal for LLM reasoning.CoRR, 2025

  24. [24]

    Agentic uncertainty quantification.CoRR, 2026

    Jiaxin Zhang, Prafulla Kumar Choubey, Kung-Hsiang Huang, Caiming Xiong, and Chien- Sheng Wu. Agentic uncertainty quantification.CoRR, 2026

  25. [25]

    L1: controlling how long A reasoning model thinks with reinforcement learning.CoRR, 2025

    Pranjal Aggarwal and Sean Welleck. L1: controlling how long A reasoning model thinks with reinforcement learning.CoRR, 2025

  26. [26]

    Learning when to plan: Efficiently allocating test-time compute for LLM agents.CoRR, 2025

    Davide Paglieri, Bartlomiej Cupial, Jonathan Cook, Ulyana Piterbarg, Jens Tuyls, Edward Grefenstette, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rockt ¨aschel. Learning when to plan: Efficiently allocating test-time compute for LLM agents.CoRR, 2025

  27. [27]

    Training verifiers to solve math word problems.CoRR, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, 2021

  28. [28]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, 2023

  29. [29]

    Revisiting uncertainty estimation and calibration of large language models.CoRR, 2025

    Linwei Tao, Yi-Fan Yeh, Minjing Dong, Tao Huang, Philip Torr, and Chang Xu. Revisiting uncertainty estimation and calibration of large language models.CoRR, 2025

  30. [30]

    Do llms estimate uncer- tainty well in instruction-following? InICLR, 2025

    Juyeon Heo, Miao Xiong, Christina Heinze-Deml, and Jaya Narain. Do llms estimate uncer- tainty well in instruction-following? InICLR, 2025

  31. [31]

    To believe or not to believe your LLM: iterative prompting for estimating epistemic uncertainty

    Yasin Abbasi-Yadkori, Ilja Kuzborskij, Andr ´as Gy ¨orgy, and Csaba Szepesv´ari. To believe or not to believe your LLM: iterative prompting for estimating epistemic uncertainty. InNeurIPS, 2024

  32. [32]

    Russell and Eric Wefald.Do the right thing - studies in limited rationality

    Stuart J. Russell and Eric Wefald.Do the right thing - studies in limited rationality. MIT Press, 1991

  33. [33]

    Sumers, and Thomas L

    Nicol `o De Sabbata, Theodore R. Sumers, and Thomas L. Griffiths. Rational metareasoning for large language models.CoRR, 2024

  34. [34]

    Qwen3 technical report.CoRR, 2025

    Qwen Team. Qwen3 technical report.CoRR, 2025

  35. [35]

    The llama 3 herd of models.CoRR, 2024

    Llama Team. The llama 3 herd of models.CoRR, 2024

  36. [36]

    Phi-3 technical report: A highly capable language model locally on your phone

    Microsoft. Phi-3 technical report: A highly capable language model locally on your phone. CoRR, 2024

  37. [37]

    Comment: Understanding simpson’s paradox

    Judea Pearl. Comment: Understanding simpson’s paradox. InProbabilistic and Causal Infer- ence: The Works of Judea Pearl. 2022

  38. [38]

    Aleatory or epistemic? does it matter?Structural safety, 2009

    Armen Der Kiureghian and Ove Ditlevsen. Aleatory or epistemic? does it matter?Structural safety, 2009

  39. [39]

    Cohen, Ruslan Salakhut- dinov, and Christopher D

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InEMNLP, 2018. 11

  40. [40]

    Measuring coding challenge competence with APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Andy Zou, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. InNeurIPS, 2021

  41. [41]

    Webshop: Towards scalable real-world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. InNeurIPS, 2022

  42. [42]

    Fever: a large-scale dataset for fact extraction and verification

    James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. InNAACL, 2018

  43. [43]

    Textworldexpress: Simulating text games at one mil- lion steps per second

    Peter Jansen and Marc-Alexandre C ˆot´e. Textworldexpress: Simulating text games at one mil- lion steps per second. InEACL, 2023

  44. [44]

    Plancraft: an evaluation dataset for plan- ning with LLM agents.CoRR, 2024

    Gautier Dagan, Frank Keller, and Alex Lascarides. Plancraft: an evaluation dataset for plan- ning with LLM agents.CoRR, 2024

  45. [45]

    Agentic reinforced policy optimization.CoRR, 2025

    Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic reinforced policy optimization.CoRR, 2025

  46. [46]

    Adaptive computation time for recurrent neural networks.CoRR, 2016

    Alex Graves. Adaptive computation time for recurrent neural networks.CoRR, 2016

  47. [47]

    Surat Teerapittayanon, Bradley McDanel, and H. T. Kung. Branchynet: Fast inference via early exiting from deep neural networks.CoRR, 2017

  48. [48]

    Layerskip: Enabling early exit inference and self- speculative decoding

    Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, and Carole-Jean Wu. Layerskip: Enabling early exit inference and self- speculative decoding. InACL, 2024

  49. [49]

    Le, Geoffrey E

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. InICLR, 2017

  50. [50]

    Richards, Timothy P

    David Raposo, Samuel Ritter, Blake A. Richards, Timothy P. Lillicrap, Peter Conway Humphreys, and Adam Santoro. Mixture-of-depths: Dynamically allocating compute in transformer-based language models.CoRR, 2024

  51. [51]

    Language models (mostly) know what they know.CoRR, 2022

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

  52. [52]

    Reasoning about beliefs and actions under computational resource constraints

    Eric Horvitz. Reasoning about beliefs and actions under computational resource constraints. In Laveen N. Kanal, Tod S. Levitt, and John F. Lemmer, editors,UAI, 1987

  53. [53]

    Resource-rational analysis: Understanding human cog- nition as the optimal use of limited computational resources.Behavioral and brain sciences, 2020

    Falk Lieder and Thomas L Griffiths. Resource-rational analysis: Understanding human cog- nition as the optimal use of limited computational resources.Behavioral and brain sciences, 2020

  54. [54]

    Alphazero-like tree-search can guide large language model decoding and train- ing

    Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and train- ing. InICML, 2024

  55. [55]

    Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search

    Kou Misaki, Yuichi Inoue, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search. CoRR, 2025

  56. [56]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought rea- soning in language models. InICLR, 2023. 12

  57. [57]

    Bradley C. A. Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher R´e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.CoRR, 2024

  58. [58]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InICLR, 2024

  59. [59]

    Math-shepherd: Verify and reinforce llms step-by-step without human annota- tions

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annota- tions. InACL, 2024

  60. [60]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, 2025

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, 2025

  61. [61]

    Tree search for language model agents.CoRR, 2024

    Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov. Tree search for language model agents.CoRR, 2024

  62. [62]

    Agent Q: advanced reasoning and learning for autonomous AI agents

    Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent Q: advanced reasoning and learning for autonomous AI agents. CoRR, 2024

  63. [63]

    Evolverouter: Co-evolving routing and prompt for multi-agent question answering.CoRR, 2026

    Jiatan Huang, Zheyuan Zhang, Kaiwen Shi, Yanfang Ye, and Chuxu Zhang. Evolverouter: Co-evolving routing and prompt for multi-agent question answering.CoRR, 2026

  64. [64]

    Agentrouter: A knowledge-graph-guided LLM router for collaborative multi-agent question answering.CoRR, 2025

    Zheyuan Zhang, Kaiwen Shi, Zhengqing Yuan, Zehong Wang, Tianyi Ma, Keerthiram Muruge- san, Vincent Galassi, Chuxu Zhang, and Yanfang Ye. Agentrouter: A knowledge-graph-guided LLM router for collaborative multi-agent question answering.CoRR, 2025

  65. [65]

    no fixed-direction gate works

    Tianyi Ma, Yiyue Qian, Zheyuan Zhang, Zehong Wang, Xiaoye Qian, Feifan Bai, Yifan Ding, Xuwei Luo, Shinan Zhang, Keerthiram Murugesan, et al. Autodata: A multi-agent system for open web data collection.CoRR, 2025. 13 A Extended Related Work A.1 Adaptive Compute: Reasoning vs. Agent Settings The adaptive test-time compute literature spans two distinct sett...