arxiv: 2605.06908 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Same Signal, Opposite Meaning: Direction-Informed Adaptive Learning for LLM Agents

Ziming Li , Jiatan Huang , Xiaoguang Guo , Guilin Wang , Chuxu Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords adaptive test-time computeLLM agentsgating signalsdirection learningcounterfactual explorationsuccess-cost trade-offrollout utility

0 comments

The pith

Gating signals for extra LLM agent compute reverse their meaning across environments and models, so direction must be learned per setting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agents decide whether to spend extra test-time computation such as rollouts based on signals like uncertainty or difficulty. Standard methods assume these signals always point the same way: higher signal means extra compute is more likely to help. The paper demonstrates that the direction is unstable, with the same signal improving outcomes in one environment or model but harming them in another even when the task stays fixed. This occurs because signals can mark either states where alternatives are worth comparing or states where the current context makes rollouts unsuitable. DIAL learns the correct direction for each environment and backbone from counterfactual rollouts, producing a better balance of task success against compute cost than fixed-direction approaches.

Core claim

The paper shows that alignment between gating signals and rollout utility is unstable: the identical signal predicts performance gain in some (environment, backbone) combinations and performance loss in others. This instability stems from the distinction between compute need and compute suitability under a two-source model. Fixed-direction gates therefore risk selecting precisely the states where extra computation reduces final outcome quality. DIAL trains a sparse gate on labels from signal-agnostic counterfactual exploration to recover the utility direction of state features for each specific (environment, backbone) pair, delivering stronger success-cost trade-offs across six environments.

What carries the argument

DIAL, a sparse gate trained via counterfactual exploration to recover the per-setting utility direction of state features.

If this is right

Wrong-direction gates degrade performance by routing extra compute exactly to states where it reduces outcome quality.
Compute need and compute suitability are distinct sources, so any fixed-direction rule becomes unreliable when environments or backbones change.
Learning the direction from data per (environment, backbone) produces a stronger overall success-cost trade-off than fixed baselines.
The two-source model explains why uncertainty signals can indicate either helpful decision difficulty or unhelpful intervention contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adaptive compute methods may require per-deployment calibration rather than one-size-fits-all rules if reversals prove common.
The counterfactual labeling approach could extend to other adaptive decisions such as retrieval or tool selection in agents.
Online variants of DIAL might update directions during operation without a separate exploration phase.
Hybrid gates that combine learned direction with other signals could improve robustness when exploration data is limited.

Load-bearing premise

Counterfactual exploration produces reliable labels for the true utility direction of state features without selection bias.

What would settle it

Measure actual rollout benefit or harm on held-out states for a new environment-backbone pair; if DIAL's learned directions disagree with these measured outcomes on more than a small fraction of states, the learned gate does not recover the correct utility direction.

Figures

Figures reproduced from arXiv: 2605.06908 by Chuxu Zhang, Guilin Wang, Jiatan Huang, Xiaoguang Guo, Ziming Li.

**Figure 1.** Figure 1: The fixed-direction assumption fails on half of the environments. Spearman correlation ρ(entropy, U) across six environments on Qwen3-4B. Answering this requires understanding why direction reverses in the first place. Hidden in the fixed-direction assumption is a deeper conflation: uncertainty or difficulty measures whether a state may benefit from more compute, but not whether that compute will actuall… view at source ↗

**Figure 2.** Figure 2: The DIAL pipeline. (a) Explore: paired counterfactual rollouts via Bernoulli-ε yield raw data Draw={(state text, R(aT )−R(aB))}. (b) Reason: from each state text, a candidate feature pool ϕcand(s) combines universal features with LLM-proposed task-specific features, yielding D={(ϕ(s), U)}. (c) Learn: fit an ℓ1-regularized logistic gate g(s); signed weights jointly select features and recover per-environme… view at source ↗

**Figure 3.** Figure 3: Gate complexity ablation on HotpotQA (Qwen3-4B). With correct direction all gates reach ≈95%; with wrong direction. Gate capacity does not substitute for direction. A natural objection is that DIAL’s gains come from gate simplicity rather than learned direction. With correct direction, a logistic gate, a 2-layer MLP, and a hidden-state probe (2560-d) all reach ≈95% SR on HotpotQA: capacity adds <1% ( [PI… view at source ↗

**Figure 4.** Figure 4: P1: ρ(entropy, U) shifts from early (blue, step ≤ median) to late (red); bars show Spearman ρ. Plancraft omitted (|ρ|=0.016). The empirical landscape (§3.2) shows direction varies but does not pin down the TwoSource Model (§3.3) as the cause. We test the model with three complementary experiments that each rule out a different class of alternative explanation: within-episode temporal dynamics (P1, obser… view at source ↗

**Figure 5.** Figure 5: Prompt template used in the LLM feature layer (Phase 2) to propose task-specific features [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Complete LLM-generated feature extraction function for WebShop (seed 42), the environ [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: SR vs. Cost (×base) across all 6 environments on Qwen3-4B. Shaded region: dominated by DIAL (lower SR and higher cost). Full numerical results across three backbones are in [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

**Figure 8.** Figure 8: Multi-backbone signal direction heatmap. [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

**Figure 9.** Figure 9: Cross-backbone comparison: DIAL (red) vs. best fixed-direction baseline (blue) across 3 [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

**Figure 10.** Figure 10: Environment-adaptive trigger behavior. Panels ordered by rollout headroom [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗

read the original abstract

Adaptive test-time compute for LLM agents aims to invoke extra computation only when it improves performance. Existing methods typically use confidence-, uncertainty-, or difficulty-based gates, assuming a fixed direction from the gating signal through compute need to the value of computation. This makes gating a utility-calibration problem: gating signals should align with whether extra computation improves the final outcome over the base policy. We show that this alignment is unstable: the same signal predicts rollout benefit in one setting and rollout harm in another, with reversals across environments and backbones even when the task is fixed. Wrong-direction gates can therefore worsen performance by precisely selecting harmful states. This reversal reflects a deeper distinction between compute need and compute suitability: a high uncertainty signal may indicate decision-difficult states where rollouts help compare alternatives, or intervention-unsuitable states where the current context does not support useful rollout-based improvement. Under this two-source model, fixed-direction gates are unreliable across heterogeneous settings. To address this, we propose DIAL (Direction-Informed Adaptive Learning), a sparse gate trained from signal-agnostic counterfactual exploration to learn the utility direction of state features per (environment, backbone). Across six environments and three backbones, DIAL yields a stronger overall success-cost trade-off than fixed-direction baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags that common gating signals for extra LLM compute can flip from helpful to harmful across environments and models, then tries to learn the right direction per setting from counterfactual rollouts.

read the letter

The main point is that signals like uncertainty do not have a stable link to whether extra computation improves outcomes. The same signal can mark states where rollouts help in one environment and states where they hurt in another, even on the same task. This comes from mixing two things: how much the state needs more thinking and whether the context actually supports useful improvement from rollouts. Fixed-direction gates therefore pick the wrong states in some settings and can lower performance while raising cost.

Referee Report

3 major / 2 minor

Summary. The paper claims that standard gating signals (e.g., uncertainty or confidence) for adaptive test-time compute in LLM agents exhibit unstable alignment with the utility of extra computation: the same signal can predict rollout benefit in one setting and harm in another, with reversals across environments and backbones even for fixed tasks. This is explained via a two-source model distinguishing compute need from compute suitability. The authors propose DIAL, a sparse gate trained via signal-agnostic counterfactual exploration to learn per-(environment, backbone) utility directions from observed outcomes, and report that it achieves a stronger overall success-cost trade-off than fixed-direction baselines across six environments and three backbones.

Significance. If the empirical results hold and the counterfactual labels are unbiased, the work is significant for LLM agent and test-time scaling research: it identifies a fundamental instability in existing adaptive-compute assumptions and supplies a data-driven mechanism to learn directions rather than assuming fixed alignment. The grounding of direction learning in counterfactual rollouts is a methodological strength that avoids pure self-reference.

major comments (3)

[Method (DIAL)] Method section (DIAL training procedure): the central claim that DIAL learns the 'true' utility direction per (env, backbone) from counterfactual exploration requires that the exploration procedure generates unbiased labels for marginal value of compute. No derivation or analysis shows the estimator remains unbiased under heterogeneous base-policy state visitation distributions; states the base policy already handles well are likely underrepresented, so observed reversals could be sampling artifacts rather than intrinsic signal instability.
[Experiments] Experimental results (across six environments and three backbones): the abstract and results claim reversals and superior success-cost trade-offs, yet no quantitative details (e.g., sign-flip magnitudes, correlation values before/after DIAL, or ablation isolating the direction-learning component) are supplied to verify that gains derive from direction learning rather than other factors in the sparse gate.
[Introduction / Two-source model] Two-source model (compute need vs. suitability): the model is invoked to explain why fixed-direction gates fail, but no formalization or testable prediction distinguishes the two sources in a way that would allow falsification of the instability claim independent of the learned gate.

minor comments (2)

[Method] Notation for the sparse gate and counterfactual labels should be defined more explicitly with symbols to avoid ambiguity when comparing to fixed-direction baselines.
[Abstract] The abstract would benefit from one concrete numerical example of a reversal (e.g., correlation sign change) to ground the instability claim before the method is introduced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential significance of identifying instability in standard gating signals for adaptive test-time compute. We address each major comment below with clarifications and commitments to revision where appropriate. Our responses focus on substance and aim to strengthen the manuscript without overstating current content.

read point-by-point responses

Referee: [Method (DIAL)] Method section (DIAL training procedure): the central claim that DIAL learns the 'true' utility direction per (env, backbone) from counterfactual exploration requires that the exploration procedure generates unbiased labels for marginal value of compute. No derivation or analysis shows the estimator remains unbiased under heterogeneous base-policy state visitation distributions; states the base policy already handles well are likely underrepresented, so observed reversals could be sampling artifacts rather than intrinsic signal instability.

Authors: We acknowledge that the manuscript lacks a formal derivation establishing unbiasedness of the counterfactual estimator for arbitrary base-policy visitation distributions. The procedure samples states from trajectories generated by the base policy and directly observes marginal outcomes via paired rollouts with and without extra compute; this yields empirical utility labels conditioned on the encountered distribution. While underrepresentation of well-handled states is possible, it aligns with the deployment distribution the gate must handle. We will revise the method section to explicitly discuss this limitation, add a note on potential sampling effects, and include sensitivity checks (e.g., reweighting or additional uniform sampling) to assess robustness of the learned directions. revision: partial
Referee: [Experiments] Experimental results (across six environments and three backbones): the abstract and results claim reversals and superior success-cost trade-offs, yet no quantitative details (e.g., sign-flip magnitudes, correlation values before/after DIAL, or ablation isolating the direction-learning component) are supplied to verify that gains derive from direction learning rather than other factors in the sparse gate.

Authors: We agree that additional quantitative breakdowns would strengthen verification that performance gains stem specifically from direction learning. The current manuscript emphasizes aggregate success-cost trade-offs across settings, but we will expand the experiments section to report: sign-flip frequencies and magnitudes in signal-utility correlations across (environment, backbone) pairs; pre- and post-DIAL correlation values; and an ablation that holds the sparse gate architecture fixed while comparing learned directions against fixed-direction variants. These additions will isolate the contribution of the direction-learning component. revision: yes
Referee: [Introduction / Two-source model] Two-source model (compute need vs. suitability): the model is invoked to explain why fixed-direction gates fail, but no formalization or testable prediction distinguishes the two sources in a way that would allow falsification of the instability claim independent of the learned gate.

Authors: The two-source model is presented as a conceptual distinction to interpret the observed reversals: compute need captures whether a state is decision-difficult, while suitability captures whether extra compute can productively improve outcomes given the current context. This framework directly predicts that fixed-direction assumptions will be unreliable across heterogeneous settings, a prediction tested via the empirical reversals and DIAL's relative gains. We will revise the introduction to state the model's core assumptions more explicitly and clarify how the counterfactual-based evaluation provides an independent test of instability (via performance of fixed versus learned gates) without circularity. revision: partial

Circularity Check

0 steps flagged

No significant circularity: direction learning grounded in external counterfactual data

full rationale

The paper defines DIAL as a sparse gate trained directly on outcomes from signal-agnostic counterfactual exploration to learn per-(env, backbone) utility directions. This training procedure uses observed rollout results as labels, making the learned gate an empirical fit to external data rather than a self-referential definition or renamed input. The instability claim rests on reported reversals across environments and backbones, presented as empirical observations rather than derived by construction from the two-source model. No equations, self-citations, or uniqueness theorems are invoked in the abstract or description to force the result; the method remains self-contained against the provided benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so full details on parameters and assumptions cannot be audited. The proposal rests on learning directions from counterfactual data and the two-source model of compute need versus suitability.

axioms (1)

domain assumption The two-source model (compute need vs. compute suitability) explains observed signal reversals.
Invoked to justify why fixed-direction gates are unreliable across heterogeneous settings.

invented entities (1)

DIAL sparse gate no independent evidence
purpose: Learns per-(environment, backbone) utility direction of state features from counterfactual exploration
New method component introduced to address direction instability; no independent evidence outside the paper is described.

pith-pipeline@v0.9.0 · 5534 in / 1369 out tokens · 65000 ms · 2026-05-11T00:48:10.199502+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We model direction reversal as arising from two coexisting regimes... Type I (intervention-unsuitable): U_I(s) ~ -α H(s) + ε_I ... Type D (decision-difficult): U_D(s) ~ +β H(s) + ε_D
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DIAL ... fits an ℓ1-regularized logistic gate ... signed weights jointly select features and recover per-environment direction

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages

[1]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InICLR, 2023

work page 2023
[2]

Re- flexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: language agents with verbal reinforcement learning. InNeurIPS, 2023

work page 2023
[3]

Language agent tree search unifies reasoning, acting, and planning in language models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In ICML, 2024

work page 2024
[4]

Mass: Mathematical data selection via skill graphs for pretraining large language models

Jiazheng Li, Lu Yu, Qing Cui, Zhiqiang Zhang, Jun Zhou, Yanfang Ye, and Chuxu Zhang. Mass: Mathematical data selection via skill graphs for pretraining large language models. In ICML, 2025

work page 2025
[5]

Graph is a substrate across data modalities.CoRR, 2026

Ziming Li, Xiaoming Wu, Zehong Wang, Jiazheng Li, Yijun Tian, Jinhe Bi, Yunpu Ma, Yan- fang Ye, and Chuxu Zhang. Graph is a substrate across data modalities.CoRR, 2026

work page 2026
[6]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023

work page 2023
[7]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InEMNLP, 2023

work page 2023
[8]

Scaling LLM test-time compute optimally can be more effective than scaling model parameters.CoRR, 2024

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.CoRR, 2024

work page 2024
[9]

Why reasoning fails to plan: A planning-centric analysis of long-horizon decision making in LLM agents.CoRR, 2026

Zehong Wang, Fang Wu, Hongru Wang, Xiangru Tang, Bolian Li, Zhenfei Yin, Yijun Ma, Yiyang Li, Weixiang Sun, Xiusi Chen, and Yanfang Ye. Why reasoning fails to plan: A planning-centric analysis of long-horizon decision making in LLM agents.CoRR, 2026

work page 2026
[10]

Scaling test-time compute for LLM agents.CoRR, 2025

King Zhu, Hanhao Li, Siwei Wu, Tianshun Xing, Dehua Ma, Xiangru Tang, Minghao Liu, Jian Yang, Jiaheng Liu, Yuchen Eleanor Jiang, Changwang Zhang, Chenghua Lin, Jun Wang, Ge Zhang, and Wangchunshu Zhou. Scaling test-time compute for LLM agents.CoRR, 2025

work page 2025
[11]

Semantic exploration with adap- tive gating for efficient problem solving with language models

Sungjae Lee, Hyejin Park, Jaechang Kim, and Jungseul Ok. Semantic exploration with adap- tive gating for efficient problem solving with language models. InACL, 2025

work page 2025
[12]

Cats: Cali- brated test-time scaling for efficient llm reasoning

Chengsong Huang, Langlin Huang, Jixuan Leng, Jiacheng Liu, and Jiaxin Huang. Cats: Cali- brated test-time scaling for efficient llm reasoning. InICLR, 2026

work page 2026
[13]

Corefine: Confidence-guided self- refinement for adaptive test-time compute.CoRR, 2026

Chen Jin, Ryutaro Tanno, Tom Diethe, and Philip Teare. Corefine: Confidence-guided self- refinement for adaptive test-time compute.CoRR, 2026

work page 2026
[14]

Cand `es, and Tatsunori Hashimoto

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel J. Cand `es, and Tatsunori Hashimoto. s1: Simple test-time scaling. InEMNLP, 2025

work page 2025
[15]

Token-budget-aware LLM reasoning

Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware LLM reasoning. InACL, 2025

work page 2025
[16]

Ma- honey, Kurt Keutzer, and Amir Gholami

Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W. Ma- honey, Kurt Keutzer, and Amir Gholami. Agentic test-time scaling for webagents.CoRR, 2026

work page 2026
[17]

Diffadapt: Difficulty-adaptive rea- soning for token-efficient LLM inference.CoRR, 2025

Xiang Liu, Xuming Hu, Xiaowen Chu, and Eunsol Choi. Diffadapt: Difficulty-adaptive rea- soning for token-efficient LLM inference.CoRR, 2025

work page 2025
[18]

The LLM already knows: Estimating llm-perceived question difficulty via hidden representations

Yubo Zhu, Dongrui Liu, Zecheng Lin, Wei Tong, Sheng Zhong, and Jing Shao. The LLM already knows: Estimating llm-perceived question difficulty via hidden representations. In EMNLP, 2025. 10

work page 2025
[19]

Adaptthink: Reasoning models can learn when to think

Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. Adaptthink: Reasoning models can learn when to think. InEMNLP, 2025

work page 2025
[20]

Thinkless: LLM learns when to think.CoRR, 2025

Gongfan Fang, Xinyin Ma, and Xinchao Wang. Thinkless: LLM learns when to think.CoRR, 2025

work page 2025
[21]

The interpretation of interaction in contingency tables.Journal of the Royal Statistical Society, 1951

Edward H Simpson. The interpretation of interaction in contingency tables.Journal of the Royal Statistical Society, 1951

work page 1951
[22]

Learning to discover various simpson’s paradoxes

Jingwei Wang, Jianshan He, Weidi Xu, Ruopeng Li, and Wei Chu. Learning to discover various simpson’s paradoxes. InKDD, 2023

work page 2023
[23]

Think just enough: Sequence-level entropy as a confidence signal for LLM reasoning.CoRR, 2025

Aman Sharma and Paras Chopra. Think just enough: Sequence-level entropy as a confidence signal for LLM reasoning.CoRR, 2025

work page 2025
[24]

Agentic uncertainty quantification.CoRR, 2026

Jiaxin Zhang, Prafulla Kumar Choubey, Kung-Hsiang Huang, Caiming Xiong, and Chien- Sheng Wu. Agentic uncertainty quantification.CoRR, 2026

work page 2026
[25]

L1: controlling how long A reasoning model thinks with reinforcement learning.CoRR, 2025

Pranjal Aggarwal and Sean Welleck. L1: controlling how long A reasoning model thinks with reinforcement learning.CoRR, 2025

work page 2025
[26]

Learning when to plan: Efficiently allocating test-time compute for LLM agents.CoRR, 2025

Davide Paglieri, Bartlomiej Cupial, Jonathan Cook, Ulyana Piterbarg, Jens Tuyls, Edward Grefenstette, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rockt ¨aschel. Learning when to plan: Efficiently allocating test-time compute for LLM agents.CoRR, 2025

work page 2025
[27]

Training verifiers to solve math word problems.CoRR, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, 2021

work page 2021
[28]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, 2023

work page 2023
[29]

Revisiting uncertainty estimation and calibration of large language models.CoRR, 2025

Linwei Tao, Yi-Fan Yeh, Minjing Dong, Tao Huang, Philip Torr, and Chang Xu. Revisiting uncertainty estimation and calibration of large language models.CoRR, 2025

work page 2025
[30]

Do llms estimate uncer- tainty well in instruction-following? InICLR, 2025

Juyeon Heo, Miao Xiong, Christina Heinze-Deml, and Jaya Narain. Do llms estimate uncer- tainty well in instruction-following? InICLR, 2025

work page 2025
[31]

To believe or not to believe your LLM: iterative prompting for estimating epistemic uncertainty

Yasin Abbasi-Yadkori, Ilja Kuzborskij, Andr ´as Gy ¨orgy, and Csaba Szepesv´ari. To believe or not to believe your LLM: iterative prompting for estimating epistemic uncertainty. InNeurIPS, 2024

work page 2024
[32]

Russell and Eric Wefald.Do the right thing - studies in limited rationality

Stuart J. Russell and Eric Wefald.Do the right thing - studies in limited rationality. MIT Press, 1991

work page 1991
[33]

Sumers, and Thomas L

Nicol `o De Sabbata, Theodore R. Sumers, and Thomas L. Griffiths. Rational metareasoning for large language models.CoRR, 2024

work page 2024
[34]

Qwen3 technical report.CoRR, 2025

Qwen Team. Qwen3 technical report.CoRR, 2025

work page 2025
[35]

The llama 3 herd of models.CoRR, 2024

Llama Team. The llama 3 herd of models.CoRR, 2024

work page 2024
[36]

Phi-3 technical report: A highly capable language model locally on your phone

Microsoft. Phi-3 technical report: A highly capable language model locally on your phone. CoRR, 2024

work page 2024
[37]

Comment: Understanding simpson’s paradox

Judea Pearl. Comment: Understanding simpson’s paradox. InProbabilistic and Causal Infer- ence: The Works of Judea Pearl. 2022

work page 2022
[38]

Aleatory or epistemic? does it matter?Structural safety, 2009

Armen Der Kiureghian and Ove Ditlevsen. Aleatory or epistemic? does it matter?Structural safety, 2009

work page 2009
[39]

Cohen, Ruslan Salakhut- dinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InEMNLP, 2018. 11

work page 2018
[40]

Measuring coding challenge competence with APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Andy Zou, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. InNeurIPS, 2021

work page 2021
[41]

Webshop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. InNeurIPS, 2022

work page 2022
[42]

Fever: a large-scale dataset for fact extraction and verification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. InNAACL, 2018

work page 2018
[43]

Textworldexpress: Simulating text games at one mil- lion steps per second

Peter Jansen and Marc-Alexandre C ˆot´e. Textworldexpress: Simulating text games at one mil- lion steps per second. InEACL, 2023

work page 2023
[44]

Plancraft: an evaluation dataset for plan- ning with LLM agents.CoRR, 2024

Gautier Dagan, Frank Keller, and Alex Lascarides. Plancraft: an evaluation dataset for plan- ning with LLM agents.CoRR, 2024

work page 2024
[45]

Agentic reinforced policy optimization.CoRR, 2025

Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic reinforced policy optimization.CoRR, 2025

work page 2025
[46]

Adaptive computation time for recurrent neural networks.CoRR, 2016

Alex Graves. Adaptive computation time for recurrent neural networks.CoRR, 2016

work page 2016
[47]

Surat Teerapittayanon, Bradley McDanel, and H. T. Kung. Branchynet: Fast inference via early exiting from deep neural networks.CoRR, 2017

work page 2017
[48]

Layerskip: Enabling early exit inference and self- speculative decoding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, and Carole-Jean Wu. Layerskip: Enabling early exit inference and self- speculative decoding. InACL, 2024

work page 2024
[49]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. InICLR, 2017

work page 2017
[50]

Richards, Timothy P

David Raposo, Samuel Ritter, Blake A. Richards, Timothy P. Lillicrap, Peter Conway Humphreys, and Adam Santoro. Mixture-of-depths: Dynamically allocating compute in transformer-based language models.CoRR, 2024

work page 2024
[51]

Language models (mostly) know what they know.CoRR, 2022

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

work page 2022
[52]

Reasoning about beliefs and actions under computational resource constraints

Eric Horvitz. Reasoning about beliefs and actions under computational resource constraints. In Laveen N. Kanal, Tod S. Levitt, and John F. Lemmer, editors,UAI, 1987

work page 1987
[53]

Resource-rational analysis: Understanding human cog- nition as the optimal use of limited computational resources.Behavioral and brain sciences, 2020

Falk Lieder and Thomas L Griffiths. Resource-rational analysis: Understanding human cog- nition as the optimal use of limited computational resources.Behavioral and brain sciences, 2020

work page 2020
[54]

Alphazero-like tree-search can guide large language model decoding and train- ing

Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and train- ing. InICML, 2024

work page 2024
[55]

Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search

Kou Misaki, Yuichi Inoue, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search. CoRR, 2025

work page 2025
[56]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought rea- soning in language models. InICLR, 2023. 12

work page 2023
[57]

Bradley C. A. Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher R´e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.CoRR, 2024

work page 2024
[58]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InICLR, 2024

work page 2024
[59]

Math-shepherd: Verify and reinforce llms step-by-step without human annota- tions

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annota- tions. InACL, 2024

work page 2024
[60]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, 2025

work page 2025
[61]

Tree search for language model agents.CoRR, 2024

Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov. Tree search for language model agents.CoRR, 2024

work page 2024
[62]

Agent Q: advanced reasoning and learning for autonomous AI agents

Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent Q: advanced reasoning and learning for autonomous AI agents. CoRR, 2024

work page 2024
[63]

Evolverouter: Co-evolving routing and prompt for multi-agent question answering.CoRR, 2026

Jiatan Huang, Zheyuan Zhang, Kaiwen Shi, Yanfang Ye, and Chuxu Zhang. Evolverouter: Co-evolving routing and prompt for multi-agent question answering.CoRR, 2026

work page 2026
[64]

Agentrouter: A knowledge-graph-guided LLM router for collaborative multi-agent question answering.CoRR, 2025

Zheyuan Zhang, Kaiwen Shi, Zhengqing Yuan, Zehong Wang, Tianyi Ma, Keerthiram Muruge- san, Vincent Galassi, Chuxu Zhang, and Yanfang Ye. Agentrouter: A knowledge-graph-guided LLM router for collaborative multi-agent question answering.CoRR, 2025

work page 2025
[65]

no fixed-direction gate works

Tianyi Ma, Yiyue Qian, Zheyuan Zhang, Zehong Wang, Xiaoye Qian, Feifan Bai, Yifan Ding, Xuwei Luo, Shinan Zhang, Keerthiram Murugesan, et al. Autodata: A multi-agent system for open web data collection.CoRR, 2025. 13 A Extended Related Work A.1 Adaptive Compute: Reasoning vs. Agent Settings The adaptive test-time compute literature spans two distinct sett...

work page 2025