Recognition: 2 theorem links
· Lean TheoremSame Signal, Opposite Meaning: Direction-Informed Adaptive Learning for LLM Agents
Pith reviewed 2026-05-11 00:48 UTC · model grok-4.3
The pith
Gating signals for extra LLM agent compute reverse their meaning across environments and models, so direction must be learned per setting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that alignment between gating signals and rollout utility is unstable: the identical signal predicts performance gain in some (environment, backbone) combinations and performance loss in others. This instability stems from the distinction between compute need and compute suitability under a two-source model. Fixed-direction gates therefore risk selecting precisely the states where extra computation reduces final outcome quality. DIAL trains a sparse gate on labels from signal-agnostic counterfactual exploration to recover the utility direction of state features for each specific (environment, backbone) pair, delivering stronger success-cost trade-offs across six environments.
What carries the argument
DIAL, a sparse gate trained via counterfactual exploration to recover the per-setting utility direction of state features.
If this is right
- Wrong-direction gates degrade performance by routing extra compute exactly to states where it reduces outcome quality.
- Compute need and compute suitability are distinct sources, so any fixed-direction rule becomes unreliable when environments or backbones change.
- Learning the direction from data per (environment, backbone) produces a stronger overall success-cost trade-off than fixed baselines.
- The two-source model explains why uncertainty signals can indicate either helpful decision difficulty or unhelpful intervention contexts.
Where Pith is reading between the lines
- Adaptive compute methods may require per-deployment calibration rather than one-size-fits-all rules if reversals prove common.
- The counterfactual labeling approach could extend to other adaptive decisions such as retrieval or tool selection in agents.
- Online variants of DIAL might update directions during operation without a separate exploration phase.
- Hybrid gates that combine learned direction with other signals could improve robustness when exploration data is limited.
Load-bearing premise
Counterfactual exploration produces reliable labels for the true utility direction of state features without selection bias.
What would settle it
Measure actual rollout benefit or harm on held-out states for a new environment-backbone pair; if DIAL's learned directions disagree with these measured outcomes on more than a small fraction of states, the learned gate does not recover the correct utility direction.
Figures
read the original abstract
Adaptive test-time compute for LLM agents aims to invoke extra computation only when it improves performance. Existing methods typically use confidence-, uncertainty-, or difficulty-based gates, assuming a fixed direction from the gating signal through compute need to the value of computation. This makes gating a utility-calibration problem: gating signals should align with whether extra computation improves the final outcome over the base policy. We show that this alignment is unstable: the same signal predicts rollout benefit in one setting and rollout harm in another, with reversals across environments and backbones even when the task is fixed. Wrong-direction gates can therefore worsen performance by precisely selecting harmful states. This reversal reflects a deeper distinction between compute need and compute suitability: a high uncertainty signal may indicate decision-difficult states where rollouts help compare alternatives, or intervention-unsuitable states where the current context does not support useful rollout-based improvement. Under this two-source model, fixed-direction gates are unreliable across heterogeneous settings. To address this, we propose DIAL (Direction-Informed Adaptive Learning), a sparse gate trained from signal-agnostic counterfactual exploration to learn the utility direction of state features per (environment, backbone). Across six environments and three backbones, DIAL yields a stronger overall success-cost trade-off than fixed-direction baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard gating signals (e.g., uncertainty or confidence) for adaptive test-time compute in LLM agents exhibit unstable alignment with the utility of extra computation: the same signal can predict rollout benefit in one setting and harm in another, with reversals across environments and backbones even for fixed tasks. This is explained via a two-source model distinguishing compute need from compute suitability. The authors propose DIAL, a sparse gate trained via signal-agnostic counterfactual exploration to learn per-(environment, backbone) utility directions from observed outcomes, and report that it achieves a stronger overall success-cost trade-off than fixed-direction baselines across six environments and three backbones.
Significance. If the empirical results hold and the counterfactual labels are unbiased, the work is significant for LLM agent and test-time scaling research: it identifies a fundamental instability in existing adaptive-compute assumptions and supplies a data-driven mechanism to learn directions rather than assuming fixed alignment. The grounding of direction learning in counterfactual rollouts is a methodological strength that avoids pure self-reference.
major comments (3)
- [Method (DIAL)] Method section (DIAL training procedure): the central claim that DIAL learns the 'true' utility direction per (env, backbone) from counterfactual exploration requires that the exploration procedure generates unbiased labels for marginal value of compute. No derivation or analysis shows the estimator remains unbiased under heterogeneous base-policy state visitation distributions; states the base policy already handles well are likely underrepresented, so observed reversals could be sampling artifacts rather than intrinsic signal instability.
- [Experiments] Experimental results (across six environments and three backbones): the abstract and results claim reversals and superior success-cost trade-offs, yet no quantitative details (e.g., sign-flip magnitudes, correlation values before/after DIAL, or ablation isolating the direction-learning component) are supplied to verify that gains derive from direction learning rather than other factors in the sparse gate.
- [Introduction / Two-source model] Two-source model (compute need vs. suitability): the model is invoked to explain why fixed-direction gates fail, but no formalization or testable prediction distinguishes the two sources in a way that would allow falsification of the instability claim independent of the learned gate.
minor comments (2)
- [Method] Notation for the sparse gate and counterfactual labels should be defined more explicitly with symbols to avoid ambiguity when comparing to fixed-direction baselines.
- [Abstract] The abstract would benefit from one concrete numerical example of a reversal (e.g., correlation sign change) to ground the instability claim before the method is introduced.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for acknowledging the potential significance of identifying instability in standard gating signals for adaptive test-time compute. We address each major comment below with clarifications and commitments to revision where appropriate. Our responses focus on substance and aim to strengthen the manuscript without overstating current content.
read point-by-point responses
-
Referee: [Method (DIAL)] Method section (DIAL training procedure): the central claim that DIAL learns the 'true' utility direction per (env, backbone) from counterfactual exploration requires that the exploration procedure generates unbiased labels for marginal value of compute. No derivation or analysis shows the estimator remains unbiased under heterogeneous base-policy state visitation distributions; states the base policy already handles well are likely underrepresented, so observed reversals could be sampling artifacts rather than intrinsic signal instability.
Authors: We acknowledge that the manuscript lacks a formal derivation establishing unbiasedness of the counterfactual estimator for arbitrary base-policy visitation distributions. The procedure samples states from trajectories generated by the base policy and directly observes marginal outcomes via paired rollouts with and without extra compute; this yields empirical utility labels conditioned on the encountered distribution. While underrepresentation of well-handled states is possible, it aligns with the deployment distribution the gate must handle. We will revise the method section to explicitly discuss this limitation, add a note on potential sampling effects, and include sensitivity checks (e.g., reweighting or additional uniform sampling) to assess robustness of the learned directions. revision: partial
-
Referee: [Experiments] Experimental results (across six environments and three backbones): the abstract and results claim reversals and superior success-cost trade-offs, yet no quantitative details (e.g., sign-flip magnitudes, correlation values before/after DIAL, or ablation isolating the direction-learning component) are supplied to verify that gains derive from direction learning rather than other factors in the sparse gate.
Authors: We agree that additional quantitative breakdowns would strengthen verification that performance gains stem specifically from direction learning. The current manuscript emphasizes aggregate success-cost trade-offs across settings, but we will expand the experiments section to report: sign-flip frequencies and magnitudes in signal-utility correlations across (environment, backbone) pairs; pre- and post-DIAL correlation values; and an ablation that holds the sparse gate architecture fixed while comparing learned directions against fixed-direction variants. These additions will isolate the contribution of the direction-learning component. revision: yes
-
Referee: [Introduction / Two-source model] Two-source model (compute need vs. suitability): the model is invoked to explain why fixed-direction gates fail, but no formalization or testable prediction distinguishes the two sources in a way that would allow falsification of the instability claim independent of the learned gate.
Authors: The two-source model is presented as a conceptual distinction to interpret the observed reversals: compute need captures whether a state is decision-difficult, while suitability captures whether extra compute can productively improve outcomes given the current context. This framework directly predicts that fixed-direction assumptions will be unreliable across heterogeneous settings, a prediction tested via the empirical reversals and DIAL's relative gains. We will revise the introduction to state the model's core assumptions more explicitly and clarify how the counterfactual-based evaluation provides an independent test of instability (via performance of fixed versus learned gates) without circularity. revision: partial
Circularity Check
No significant circularity: direction learning grounded in external counterfactual data
full rationale
The paper defines DIAL as a sparse gate trained directly on outcomes from signal-agnostic counterfactual exploration to learn per-(env, backbone) utility directions. This training procedure uses observed rollout results as labels, making the learned gate an empirical fit to external data rather than a self-referential definition or renamed input. The instability claim rests on reported reversals across environments and backbones, presented as empirical observations rather than derived by construction from the two-source model. No equations, self-citations, or uniqueness theorems are invoked in the abstract or description to force the result; the method remains self-contained against the provided benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The two-source model (compute need vs. compute suitability) explains observed signal reversals.
invented entities (1)
-
DIAL sparse gate
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We model direction reversal as arising from two coexisting regimes... Type I (intervention-unsuitable): U_I(s) ~ -α H(s) + ε_I ... Type D (decision-difficult): U_D(s) ~ +β H(s) + ε_D
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DIAL ... fits an ℓ1-regularized logistic gate ... signed weights jointly select features and recover per-environment direction
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InICLR, 2023
work page 2023
-
[2]
Re- flexion: language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: language agents with verbal reinforcement learning. InNeurIPS, 2023
work page 2023
-
[3]
Language agent tree search unifies reasoning, acting, and planning in language models
Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In ICML, 2024
work page 2024
-
[4]
Mass: Mathematical data selection via skill graphs for pretraining large language models
Jiazheng Li, Lu Yu, Qing Cui, Zhiqiang Zhang, Jun Zhou, Yanfang Ye, and Chuxu Zhang. Mass: Mathematical data selection via skill graphs for pretraining large language models. In ICML, 2025
work page 2025
-
[5]
Graph is a substrate across data modalities.CoRR, 2026
Ziming Li, Xiaoming Wu, Zehong Wang, Jiazheng Li, Yijun Tian, Jinhe Bi, Yunpu Ma, Yan- fang Ye, and Chuxu Zhang. Graph is a substrate across data modalities.CoRR, 2026
work page 2026
-
[6]
Tree of thoughts: Deliberate problem solving with large language models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023
work page 2023
-
[7]
Reasoning with language model is planning with world model
Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InEMNLP, 2023
work page 2023
-
[8]
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.CoRR, 2024
work page 2024
-
[9]
Zehong Wang, Fang Wu, Hongru Wang, Xiangru Tang, Bolian Li, Zhenfei Yin, Yijun Ma, Yiyang Li, Weixiang Sun, Xiusi Chen, and Yanfang Ye. Why reasoning fails to plan: A planning-centric analysis of long-horizon decision making in LLM agents.CoRR, 2026
work page 2026
-
[10]
Scaling test-time compute for LLM agents.CoRR, 2025
King Zhu, Hanhao Li, Siwei Wu, Tianshun Xing, Dehua Ma, Xiangru Tang, Minghao Liu, Jian Yang, Jiaheng Liu, Yuchen Eleanor Jiang, Changwang Zhang, Chenghua Lin, Jun Wang, Ge Zhang, and Wangchunshu Zhou. Scaling test-time compute for LLM agents.CoRR, 2025
work page 2025
-
[11]
Semantic exploration with adap- tive gating for efficient problem solving with language models
Sungjae Lee, Hyejin Park, Jaechang Kim, and Jungseul Ok. Semantic exploration with adap- tive gating for efficient problem solving with language models. InACL, 2025
work page 2025
-
[12]
Cats: Cali- brated test-time scaling for efficient llm reasoning
Chengsong Huang, Langlin Huang, Jixuan Leng, Jiacheng Liu, and Jiaxin Huang. Cats: Cali- brated test-time scaling for efficient llm reasoning. InICLR, 2026
work page 2026
-
[13]
Corefine: Confidence-guided self- refinement for adaptive test-time compute.CoRR, 2026
Chen Jin, Ryutaro Tanno, Tom Diethe, and Philip Teare. Corefine: Confidence-guided self- refinement for adaptive test-time compute.CoRR, 2026
work page 2026
-
[14]
Cand `es, and Tatsunori Hashimoto
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel J. Cand `es, and Tatsunori Hashimoto. s1: Simple test-time scaling. InEMNLP, 2025
work page 2025
-
[15]
Token-budget-aware LLM reasoning
Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware LLM reasoning. InACL, 2025
work page 2025
-
[16]
Ma- honey, Kurt Keutzer, and Amir Gholami
Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W. Ma- honey, Kurt Keutzer, and Amir Gholami. Agentic test-time scaling for webagents.CoRR, 2026
work page 2026
-
[17]
Diffadapt: Difficulty-adaptive rea- soning for token-efficient LLM inference.CoRR, 2025
Xiang Liu, Xuming Hu, Xiaowen Chu, and Eunsol Choi. Diffadapt: Difficulty-adaptive rea- soning for token-efficient LLM inference.CoRR, 2025
work page 2025
-
[18]
The LLM already knows: Estimating llm-perceived question difficulty via hidden representations
Yubo Zhu, Dongrui Liu, Zecheng Lin, Wei Tong, Sheng Zhong, and Jing Shao. The LLM already knows: Estimating llm-perceived question difficulty via hidden representations. In EMNLP, 2025. 10
work page 2025
-
[19]
Adaptthink: Reasoning models can learn when to think
Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. Adaptthink: Reasoning models can learn when to think. InEMNLP, 2025
work page 2025
-
[20]
Thinkless: LLM learns when to think.CoRR, 2025
Gongfan Fang, Xinyin Ma, and Xinchao Wang. Thinkless: LLM learns when to think.CoRR, 2025
work page 2025
-
[21]
Edward H Simpson. The interpretation of interaction in contingency tables.Journal of the Royal Statistical Society, 1951
work page 1951
-
[22]
Learning to discover various simpson’s paradoxes
Jingwei Wang, Jianshan He, Weidi Xu, Ruopeng Li, and Wei Chu. Learning to discover various simpson’s paradoxes. InKDD, 2023
work page 2023
-
[23]
Think just enough: Sequence-level entropy as a confidence signal for LLM reasoning.CoRR, 2025
Aman Sharma and Paras Chopra. Think just enough: Sequence-level entropy as a confidence signal for LLM reasoning.CoRR, 2025
work page 2025
-
[24]
Agentic uncertainty quantification.CoRR, 2026
Jiaxin Zhang, Prafulla Kumar Choubey, Kung-Hsiang Huang, Caiming Xiong, and Chien- Sheng Wu. Agentic uncertainty quantification.CoRR, 2026
work page 2026
-
[25]
L1: controlling how long A reasoning model thinks with reinforcement learning.CoRR, 2025
Pranjal Aggarwal and Sean Welleck. L1: controlling how long A reasoning model thinks with reinforcement learning.CoRR, 2025
work page 2025
-
[26]
Learning when to plan: Efficiently allocating test-time compute for LLM agents.CoRR, 2025
Davide Paglieri, Bartlomiej Cupial, Jonathan Cook, Ulyana Piterbarg, Jens Tuyls, Edward Grefenstette, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rockt ¨aschel. Learning when to plan: Efficiently allocating test-time compute for LLM agents.CoRR, 2025
work page 2025
-
[27]
Training verifiers to solve math word problems.CoRR, 2021
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, 2021
work page 2021
-
[28]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, 2023
work page 2023
-
[29]
Revisiting uncertainty estimation and calibration of large language models.CoRR, 2025
Linwei Tao, Yi-Fan Yeh, Minjing Dong, Tao Huang, Philip Torr, and Chang Xu. Revisiting uncertainty estimation and calibration of large language models.CoRR, 2025
work page 2025
-
[30]
Do llms estimate uncer- tainty well in instruction-following? InICLR, 2025
Juyeon Heo, Miao Xiong, Christina Heinze-Deml, and Jaya Narain. Do llms estimate uncer- tainty well in instruction-following? InICLR, 2025
work page 2025
-
[31]
To believe or not to believe your LLM: iterative prompting for estimating epistemic uncertainty
Yasin Abbasi-Yadkori, Ilja Kuzborskij, Andr ´as Gy ¨orgy, and Csaba Szepesv´ari. To believe or not to believe your LLM: iterative prompting for estimating epistemic uncertainty. InNeurIPS, 2024
work page 2024
-
[32]
Russell and Eric Wefald.Do the right thing - studies in limited rationality
Stuart J. Russell and Eric Wefald.Do the right thing - studies in limited rationality. MIT Press, 1991
work page 1991
-
[33]
Nicol `o De Sabbata, Theodore R. Sumers, and Thomas L. Griffiths. Rational metareasoning for large language models.CoRR, 2024
work page 2024
- [34]
-
[35]
The llama 3 herd of models.CoRR, 2024
Llama Team. The llama 3 herd of models.CoRR, 2024
work page 2024
-
[36]
Phi-3 technical report: A highly capable language model locally on your phone
Microsoft. Phi-3 technical report: A highly capable language model locally on your phone. CoRR, 2024
work page 2024
-
[37]
Comment: Understanding simpson’s paradox
Judea Pearl. Comment: Understanding simpson’s paradox. InProbabilistic and Causal Infer- ence: The Works of Judea Pearl. 2022
work page 2022
-
[38]
Aleatory or epistemic? does it matter?Structural safety, 2009
Armen Der Kiureghian and Ove Ditlevsen. Aleatory or epistemic? does it matter?Structural safety, 2009
work page 2009
-
[39]
Cohen, Ruslan Salakhut- dinov, and Christopher D
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InEMNLP, 2018. 11
work page 2018
-
[40]
Measuring coding challenge competence with APPS
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Andy Zou, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. InNeurIPS, 2021
work page 2021
-
[41]
Webshop: Towards scalable real-world web interaction with grounded language agents
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. InNeurIPS, 2022
work page 2022
-
[42]
Fever: a large-scale dataset for fact extraction and verification
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. InNAACL, 2018
work page 2018
-
[43]
Textworldexpress: Simulating text games at one mil- lion steps per second
Peter Jansen and Marc-Alexandre C ˆot´e. Textworldexpress: Simulating text games at one mil- lion steps per second. InEACL, 2023
work page 2023
-
[44]
Plancraft: an evaluation dataset for plan- ning with LLM agents.CoRR, 2024
Gautier Dagan, Frank Keller, and Alex Lascarides. Plancraft: an evaluation dataset for plan- ning with LLM agents.CoRR, 2024
work page 2024
-
[45]
Agentic reinforced policy optimization.CoRR, 2025
Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic reinforced policy optimization.CoRR, 2025
work page 2025
-
[46]
Adaptive computation time for recurrent neural networks.CoRR, 2016
Alex Graves. Adaptive computation time for recurrent neural networks.CoRR, 2016
work page 2016
-
[47]
Surat Teerapittayanon, Bradley McDanel, and H. T. Kung. Branchynet: Fast inference via early exiting from deep neural networks.CoRR, 2017
work page 2017
-
[48]
Layerskip: Enabling early exit inference and self- speculative decoding
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, and Carole-Jean Wu. Layerskip: Enabling early exit inference and self- speculative decoding. InACL, 2024
work page 2024
-
[49]
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. InICLR, 2017
work page 2017
-
[50]
David Raposo, Samuel Ritter, Blake A. Richards, Timothy P. Lillicrap, Peter Conway Humphreys, and Adam Santoro. Mixture-of-depths: Dynamically allocating compute in transformer-based language models.CoRR, 2024
work page 2024
-
[51]
Language models (mostly) know what they know.CoRR, 2022
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...
work page 2022
-
[52]
Reasoning about beliefs and actions under computational resource constraints
Eric Horvitz. Reasoning about beliefs and actions under computational resource constraints. In Laveen N. Kanal, Tod S. Levitt, and John F. Lemmer, editors,UAI, 1987
work page 1987
-
[53]
Falk Lieder and Thomas L Griffiths. Resource-rational analysis: Understanding human cog- nition as the optimal use of limited computational resources.Behavioral and brain sciences, 2020
work page 2020
-
[54]
Alphazero-like tree-search can guide large language model decoding and train- ing
Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and train- ing. InICML, 2024
work page 2024
-
[55]
Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search
Kou Misaki, Yuichi Inoue, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search. CoRR, 2025
work page 2025
- [56]
-
[57]
Bradley C. A. Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher R´e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.CoRR, 2024
work page 2024
-
[58]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InICLR, 2024
work page 2024
-
[59]
Math-shepherd: Verify and reinforce llms step-by-step without human annota- tions
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annota- tions. InACL, 2024
work page 2024
-
[60]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, 2025
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, 2025
work page 2025
-
[61]
Tree search for language model agents.CoRR, 2024
Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov. Tree search for language model agents.CoRR, 2024
work page 2024
-
[62]
Agent Q: advanced reasoning and learning for autonomous AI agents
Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent Q: advanced reasoning and learning for autonomous AI agents. CoRR, 2024
work page 2024
-
[63]
Evolverouter: Co-evolving routing and prompt for multi-agent question answering.CoRR, 2026
Jiatan Huang, Zheyuan Zhang, Kaiwen Shi, Yanfang Ye, and Chuxu Zhang. Evolverouter: Co-evolving routing and prompt for multi-agent question answering.CoRR, 2026
work page 2026
-
[64]
Zheyuan Zhang, Kaiwen Shi, Zhengqing Yuan, Zehong Wang, Tianyi Ma, Keerthiram Muruge- san, Vincent Galassi, Chuxu Zhang, and Yanfang Ye. Agentrouter: A knowledge-graph-guided LLM router for collaborative multi-agent question answering.CoRR, 2025
work page 2025
-
[65]
Tianyi Ma, Yiyue Qian, Zheyuan Zhang, Zehong Wang, Xiaoye Qian, Feifan Bai, Yifan Ding, Xuwei Luo, Shinan Zhang, Keerthiram Murugesan, et al. Autodata: A multi-agent system for open web data collection.CoRR, 2025. 13 A Extended Related Work A.1 Adaptive Compute: Reasoning vs. Agent Settings The adaptive test-time compute literature spans two distinct sett...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.