pith. sign in

arxiv: 2606.29745 · v1 · pith:F23SBWARnew · submitted 2026-06-29 · 💻 cs.MA

ECHO: Learning Epistemically Adaptive Language Agents with Turn-Level Credit

Pith reviewed 2026-06-30 04:32 UTC · model grok-4.3

classification 💻 cs.MA
keywords epistemic decision processeslanguage agentsturn-level credit assignmentposterior-sensitive rewardsinformation seekingbelief statespolicy gradientsmulti-turn adaptivity
0
0 comments X

The pith

Belief-agnostic policies suffer exponentially compounding errors over multi-turn horizons, while ECHO assigns turn-level credit via posterior-sensitive rewards to enable epistemic adaptivity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Epistemic Decision Processes as belief-state models in which language agents act to update posteriors over a latent task variable through observations. It proves that policies blind to the current belief produce errors that compound exponentially with horizon length, and that whole-trajectory returns cannot isolate the per-turn Bayesian advantage required for credit assignment. ECHO is introduced as a clipped policy-gradient objective that incorporates posterior-sensitive rewards to supply turn-level signals. In the Clue Selector Game benchmark this yields higher resolution, information gain, and efficiency than trajectory-level methods while matching frontier performance on grounding, recovery, and calibration.

Core claim

In Epistemic Decision Processes good policies select actions that are useful under the current posterior over a latent task variable rather than merely those correlated with eventual success. Belief-agnostic policies suffer errors that compound exponentially over the horizon, and aggregate trajectory returns fail to identify the per-turn Bayesian advantage needed for epistemic credit. ECHO supplies a practical clipped policy-gradient objective that assigns turn-level credit using posterior-sensitive rewards.

What carries the argument

Epistemic Decision Processes (EDPs), a belief-state formulation of information seeking in which actions update the agent's posterior over a latent task variable, together with the ECHO clipped policy-gradient objective that uses posterior-sensitive rewards for turn-level credit assignment.

If this is right

  • ECHO improves resolution, information gain, and efficiency over trajectory-level GRPO in the Clue Selector Game.
  • ECHO matches or exceeds frontier baselines on epistemic metrics including grounding, recovery, and calibration while producing almost no visible reasoning text.
  • Turn-level posterior rewards address the inability of aggregate returns to isolate per-turn Bayesian advantages.
  • The approach makes epistemic adaptivity explicit by tying action selection to current belief rather than eventual success alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The EDP formulation could extend to other multi-turn settings where agents must decide when to gather evidence versus act, such as diagnostic or planning dialogues.
  • If exact posterior tracking proves intractable at scale, lightweight approximations might still preserve the turn-level credit signal.
  • The exponential compounding result suggests testing whether similar error growth appears in non-language POMDP tasks with evolving beliefs.

Load-bearing premise

A well-defined latent task variable exists whose posterior can be maintained and updated from observations to produce usable per-turn rewards for the policy gradient.

What would settle it

An experiment in which belief-agnostic policies exhibit no exponential error compounding across turns, or in which posterior maintenance fails to generate effective turn-level reward signals.

Figures

Figures reproduced from arXiv: 2606.29745 by Abhijnan Nath, Nikhil Krishnaswamy.

Figure 1
Figure 1. Figure 1: Results on multi-turn information seeking in the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: EDP intuition. Two histories can share the same belief-independent task context s(ht) = s(h ′ t) while inducing different beliefs bt ̸= b ′ t over the latent task variable. An epistemi￾cally adaptive policy conditions on the belief state at each turn t to choose different useful actions, so a ⋆ (bt) ̸= a ⋆ (b ′ t). A belief-agnostic policy that depends only on s(ht) and t, but not bt, cannot make this dist… view at source ↗
Figure 3
Figure 3. Figure 3: Clue Selector Game overview and example trajectory from real evaluation data: [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Behavioral diagnostics in CSG. ECHO succeeds by reducing zero-elimination turns, maintaining high-quality belief contraction, shrinking the candidate set faster, and recovering more robustly after epistemic failures. Results on question type are shown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance and behavioral trends during [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompts used in the Clue Selector Game (CSG). (A) The selector prompt shown to the information￾seeking agent at each turn t. The current candidate set Ct, remaining turn budget, and arm-specific query histories encode the agent’s observable belief state. The agent must return JSON specifying an arm index and a property question, without a chain-of-thought scaffold. (B) The oracle agent prompt used by a fix… view at source ↗
Figure 7
Figure 7. Figure 7: Additional learning dynamics of RL baselines and diagnostic metrics across training steps [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: CSG dataset structure. The dataset is organized from model-level summaries to run-level [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
read the original abstract

What does it mean for a language agent to be adaptive? Effective multi-turn agents must decide what information to seek, how to use new evidence, and when they are certain enough to act. We introduce Epistemic Decision Processes (EDPs), a belief-state formulation of multi-turn information seeking in which actions produce external observations that update the agent's posterior over a latent task variable. EDPs make epistemic adaptivity explicit: good policies choose actions that are useful under the current belief, not merely those that correlate with eventual success. We prove that belief-agnostic policies can suffer errors that compound exponentially over the horizon, and that aggregate trajectory returns can fail to identify the per-turn Bayesian advantage needed for epistemic credit. We then introduce ECHO (Epistemic Credit for History-Conditioned Optimization), a practical clipped policy-gradient objective that assigns turn-level credit using posterior-sensitive rewards. In the Clue Selector Game, a novel controlled evidence-seeking benchmark, we show that ECHO substantially improves resolution, information gain, and efficiency over trajectory-level GRPO, and matches or exceeds frontier baselines on epistemic metrics such as grounding, recovery, and calibration while producing almost no visible reasoning text.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Epistemic Decision Processes (EDPs) as a belief-state formulation of multi-turn information-seeking tasks where actions yield observations that update a posterior over a latent task variable θ. It proves that belief-agnostic policies incur exponentially compounding errors over the horizon and that aggregate trajectory returns fail to isolate per-turn Bayesian advantages. It then defines the ECHO clipped policy-gradient objective that incorporates posterior-sensitive rewards r_t(θ) for turn-level credit assignment. Empirical evaluation on the Clue Selector Game benchmark shows ECHO improving resolution, information gain, efficiency, grounding, recovery, and calibration relative to trajectory-level GRPO while producing minimal visible reasoning text.

Significance. If the central claims hold, the work supplies a formal framework and practical objective for training language agents that explicitly optimize for epistemic adaptivity rather than end-of-trajectory success. The theoretical separation between belief-agnostic and belief-aware policies, together with the controlled Clue Selector Game benchmark, would be a useful contribution to the literature on multi-turn RL for LLMs. The paper receives credit for defining EDPs, stating the exponential-compounding result, and releasing a reproducible benchmark comparison.

major comments (3)
  1. [Abstract / EDP formulation] Abstract and EDP definition: the central claims (exponential error compounding and the necessity of posterior-sensitive rewards) rest on the existence of a well-defined latent θ whose posterior p(θ|h) can be maintained and updated from language-model observations; no identifiability conditions, approximation-error bounds, or robustness analysis for cases in which observations do not cleanly separate θ are supplied, which directly affects whether the claimed separation from GRPO is realizable.
  2. [Theoretical results on belief-agnostic policies] Proof of exponential error compounding: the derivation presupposes observable updates to p(θ|h) that yield usable per-turn Bayesian advantages; without explicit conditions on the observation model or bounds on posterior estimation error, it is unclear whether the exponential-compounding result transfers to the language-agent setting where the posterior model may differ from the policy.
  3. [ECHO objective and Clue Selector Game experiments] ECHO objective and experimental section: the clipped policy-gradient uses posterior-sensitive rewards r_t(θ); if the posterior estimator introduces bias or noise (as is likely when the model used for p(θ|h) differs from the policy), the turn-level credit signal may not reliably outperform trajectory-level baselines, yet no ablation on posterior accuracy or statistical significance tests with error bars are reported.
minor comments (2)
  1. [Method] Notation for the posterior-sensitive reward r_t(θ) and the clipping parameter in the ECHO objective should be defined explicitly with reference to the standard PPO/GRPO formulation to aid readability.
  2. [Experiments] The claim of 'almost no visible reasoning text' is interesting but would benefit from a quantitative metric or example traces in the results section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the EDP formulation, theoretical results, and experimental validation. We address each major comment below, clarifying the assumptions in our framework and noting planned revisions to enhance robustness.

read point-by-point responses
  1. Referee: [Abstract / EDP formulation] Abstract and EDP definition: the central claims (exponential error compounding and the necessity of posterior-sensitive rewards) rest on the existence of a well-defined latent θ whose posterior p(θ|h) can be maintained and updated from language-model observations; no identifiability conditions, approximation-error bounds, or robustness analysis for cases in which observations do not cleanly separate θ are supplied, which directly affects whether the claimed separation from GRPO is realizable.

    Authors: The EDP formulation explicitly defines a latent task variable θ with observations that update the posterior p(θ|h). The theoretical claims are derived under this model assuming the observation process allows for posterior updates. In the Clue Selector Game benchmark, θ is the hidden clue, which is fully identifiable, allowing exact posterior maintenance. We do not provide general identifiability conditions for arbitrary language observations, as the work focuses on the controlled setting where the separation is demonstrated. We will revise the manuscript to include a discussion of the identifiability assumption and potential approximation errors in more general settings. revision: partial

  2. Referee: [Theoretical results on belief-agnostic policies] Proof of exponential error compounding: the derivation presupposes observable updates to p(θ|h) that yield usable per-turn Bayesian advantages; without explicit conditions on the observation model or bounds on posterior estimation error, it is unclear whether the exponential-compounding result transfers to the language-agent setting where the posterior model may differ from the policy.

    Authors: The proof is conducted within the EDP framework, which by definition includes observable updates to the belief state p(θ|h). The exponential compounding result is a mathematical property of belief-agnostic policies in this setting and does not rely on a specific posterior estimator. In the language-agent experiments, the benchmark ensures that the posterior is computed consistently with the observations. We will add explicit statements on the observation model assumptions to the theoretical section in the revision. revision: partial

  3. Referee: [ECHO objective and Clue Selector Game experiments] ECHO objective and experimental section: the clipped policy-gradient uses posterior-sensitive rewards r_t(θ); if the posterior estimator introduces bias or noise (as is likely when the model used for p(θ|h) differs from the policy), the turn-level credit signal may not reliably outperform trajectory-level baselines, yet no ablation on posterior accuracy or statistical significance tests with error bars are reported.

    Authors: In the Clue Selector Game, the posterior p(θ|h) is computed exactly from the game state and observations without any estimation bias or noise, as the latent θ is known to the environment. Thus, the posterior-sensitive rewards are precise, and the policy and posterior are not from differing models. We agree that reporting statistical significance and error bars would strengthen the results. We will include these in the revised experimental section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained within introduced EDP framework

full rationale

The paper defines EDPs as a new belief-state formulation, proves properties of belief-agnostic policies under that model, and derives the ECHO objective from posterior-sensitive rewards within the same framework. These steps rely on the modeling choice of a latent task variable and its posterior updates, which are external to the derivation rather than self-referential. No equations reduce predictions or advantages to fitted parameters defined from the same data, and no load-bearing self-citations or ansatzes are invoked. The central claims remain independent of the target results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract-only; the central claim rests on the existence of a latent task variable with computable posterior updates and on the validity of the exponential compounding proof, none of which can be audited without the full text.

axioms (1)
  • domain assumption Actions produce external observations that update the agent's posterior over a latent task variable via standard Bayesian updating.
    Stated in the definition of EDPs in the abstract.
invented entities (2)
  • Epistemic Decision Processes (EDPs) no independent evidence
    purpose: Belief-state formulation of multi-turn information seeking
    New modeling framework introduced in the abstract; no independent evidence provided beyond the paper's own definition.
  • ECHO objective no independent evidence
    purpose: Clipped policy-gradient with posterior-sensitive rewards for turn-level credit
    New optimization method introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5734 in / 1481 out tokens · 43869 ms · 2026-06-30T04:32:15.827369+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

92 extracted references · 34 canonical work pages · 12 internal anchors

  1. [1]

    LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

    Marwa Abdulhai, Isadora White, Charlie Victor Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, and Sergey Levine. LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models. InForty-second International Conference on Machine Learning, 2023

  2. [2]

    Back to Basics: Revisiting REINFORCE -Style Optimization for Learning from Human Feedback in LLM s

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational...

  3. [3]

    STaR- GATE: Teaching Language Models to Ask Clarifying Questions

    Chinmaya Andukuri, Jan-Philipp Fränken, Tobias Gerstenberg, and Noah Goodman. STaR- GATE: Teaching Language Models to Ask Clarifying Questions. InFirst Conference on Language Modeling, 2024

  4. [4]

    Optimal control of markov processes with incomplete state information i

    Karl Johan Åström. Optimal control of markov processes with incomplete state information i. Journal of mathematical analysis and applications, 10:174–205, 1965. 10

  5. [5]

    Align while search: Belief- guided exploratory inference for world-grounded embodied agents, 2025

    Seohui Bae, Jeonghye Kim, Youngchul Sung, and Woohyung Lim. Align while search: Belief- guided exploratory inference for world-grounded embodied agents, 2025. URL https:// arxiv.org/abs/2512.24461

  6. [6]

    Recent advances in hierarchical reinforcement learning.Discrete event dynamic systems, 13(1-2):41–77, 2003

    Andrew G Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning.Discrete event dynamic systems, 13(1-2):41–77, 2003

  7. [7]

    A problem in the sequential design of experiments.Sankhy ¯a: The Indian Journal of Statistics (1933-1960), 16(3/4):221–229, 1956

    Richard Bellman. A problem in the sequential design of experiments.Sankhy ¯a: The Indian Journal of Statistics (1933-1960), 16(3/4):221–229, 1956

  8. [8]

    Chatgpt’s information seeking strategy: Insights from the 20-questions game

    Leonardo Bertolazzi, Davide Mazzaccara, Filippo Merlo, and Raffaella Bernardi. Chatgpt’s information seeking strategy: Insights from the 20-questions game. InProceedings of the 16th International Natural Language Generation Conference, pages 153–162, 2023

  9. [9]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- teusz Litwin, S...

  10. [10]

    BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

    Deepro Choudhury, Sinead Williamson, Adam Goli´nski, Ning Miao, Freddie Bickford Smith, Michael Kirchhof, Yizhe Zhang, and Tom Rainforth. Bed-llm: Intelligent information gathering with llms and bayesian experimental design.arXiv preprint arXiv:2508.21184, 2025

  11. [11]

    Process reward models for llm agents: Practical framework and directions,

    Sanjiban Choudhury. Process reward models for llm agents: Practical framework and directions,

  12. [12]

    URLhttps://arxiv.org/abs/2502.10325

  13. [13]

    Children’s questions: A mecha- nism for cognitive development.Monographs of the society for research in child development, pages i–129, 2007

    Michelle M Chouinard, Paul L Harris, and Michael P Maratsos. Children’s questions: A mecha- nism for cognitive development.Monographs of the society for research in child development, pages i–129, 2007

  14. [14]

    Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

  15. [15]

    RLHF Workflow: From Reward Modeling to Online RLHF

    Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024

  16. [16]

    University of Massachusetts Amherst, 2002

    Michael O’Gordon Duff.Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. University of Massachusetts Amherst, 2002

  17. [17]

    BALAR : A Bayesian Agentic Loop for Active Reasoning

    Aymen Echarghaoui, Dongxia Wu, and Emily B Fox. Balar: A bayesian agentic loop for active reasoning.arXiv preprint arXiv:2605.05386, 2026

  18. [18]

    Is in-context learning in large language models bayesian? a martingale perspective

    Fabian Falck, Ziyu Wang, and Christopher C Holmes. Is in-context learning in large language models bayesian? a martingale perspective. InInternational Conference on Machine Learning, pages 12784–12805. PMLR, 2024

  19. [19]

    From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

    Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678, 2025

  20. [20]

    Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies, 2021

    Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies, 2021. URLhttps://arxiv.org/abs/2101.02235

  21. [21]

    Bayesian rein- forcement learning: A survey.Foundations and Trends® in Machine Learning, 8(5-6): 359–483, November 2015

    Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, and Aviv Tamar. Bayesian rein- forcement learning: A survey.Foundations and Trends® in Machine Learning, 8(5-6): 359–483, November 2015. ISSN 1935-8245. doi: 10.1561/2200000049. URL http: //dx.doi.org/10.1561/2200000049. 11

  22. [22]

    Variance reduction techniques for gradient estimates in reinforcement learning.Journal of Machine Learning Research, 5(Nov): 1471–1530, 2004

    Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning.Journal of Machine Learning Research, 5(Nov): 1471–1530, 2004

  23. [23]

    Segment policy optimization: Ef- fective segment-level credit assignment in rl for large language models.arXiv preprint arXiv:2505.23564, 2025

    Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, and Shuang Qiu. Segment policy optimization: Ef- fective segment-level credit assignment in rl for large language models.arXiv preprint arXiv:2505.23564, 2025

  24. [24]

    Information value theory.IEEE Transactions on systems science and cybernetics, 2(1):22–26, 1966

    Ronald A Howard. Information value theory.IEEE Transactions on systems science and cybernetics, 2(1):22–26, 1966

  25. [25]

    Playing 20 question game with policy-based reinforcement learning, 2026

    Huang Hu, Xianchao Wu, Bingfeng Luo, Chongyang Tao, Can Xu, Wei Wu, and Zhan Chen. Playing 20 question game with policy-based reinforcement learning, 2026. URL https: //arxiv.org/abs/1808.07645

  26. [26]

    Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

    Xingyuan Hua, Sheng Yue, and Ju Ren. Learning to explore: Scaling agentic reasoning via exploration-aware policy optimization, 2026. URLhttps://arxiv.org/abs/2605.08978

  27. [27]

    Mysteries of mode collapse

    janus. Mysteries of mode collapse. https://www.lesswrong.com/posts/ t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse , November 2022. LessWrong post. Accessed: 2026-05-12

  28. [28]

    Reflect- then-plan: Offline model-based planning through a doubly bayesian lens

    Jihwan Jeong, Xiaoyu Wang, Jingmin Wang, Scott Sanner, and Pascal Poupart. Reflect- then-plan: Offline model-based planning through a doubly bayesian lens. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=pQYEMwHd6c

  29. [29]

    Search-r1: Training llms to reason and leverage search engines with reinforcement learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. InSecond Conference on Language Modeling

  30. [30]

    DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

    Hongbo Jin, Rongpeng Zhu, Zhongjing Du, Xu Jiang, Jingqi Tian, Qiaoman Zhang, and Jiayu Ding. Dgpo: Distribution guided policy optimization for fine grained credit assignment.arXiv preprint arXiv:2605.03327, 2026

  31. [31]

    Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998

    Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998

  32. [32]

    Vineppo: Refining credit assignment in rl training of llms, 2025

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms, 2025. URLhttps://arxiv.org/abs/2410.01679

  33. [33]

    Active task disambiguation with llms, 2025

    Katarzyna Kobalczyk, Nicolas Astorga, Tennison Liu, and Mihaela van der Schaar. Active task disambiguation with llms, 2025. URLhttps://arxiv.org/abs/2502.04485

  34. [34]

    In-context learning learns label relationships but is not conventional learning

    Jannik Kossen, Yarin Gal, and Tom Rainforth. In-context learning learns label relationships but is not conventional learning. InThe Twelfth International Conference on Learning Representations,

  35. [35]

    URLhttps://openreview.net/forum?id=YPIA7bgd5y

  36. [36]

    LLMs get lost in multi- turn conversation

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. LLMs get lost in multi- turn conversation. InThe Fourteenth International Conference on Learning Representations,

  37. [37]

    URLhttps://openreview.net/forum?id=VKGTGGcwl6

  38. [38]

    PRInTS: Reward Modeling for Long-Horizon Information Seeking

    Jaewoo Lee, Archiki Prasad, Justin Chih-Yao Chen, Zaid Khan, Elias Stengel-Eskin, and Mohit Bansal. Prints: Reward modeling for long-horizon information seeking.arXiv preprint arXiv:2511.19314, 2025

  39. [39]

    Alfa: Aligning llms to ask good questions a case study in clinical reasoning

    Shuyue Stella Li, Jimin Mun, Faeze Brahman, Pedram Hosseini, Bryceton G Thomas, Jessica M Sin, Bing Ren, Jonathan S Ilgen, Yulia Tsvetkov, and Maarten Sap. Alfa: Aligning llms to ask good questions a case study in clinical reasoning. InSecond Conference on Language Modeling

  40. [40]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. doi: 10.1162/tacl_a_00638. URLhttps://aclanthology.org/2024.tacl-1.9/. 12

  41. [41]

    Reason- ing models can be effective without thinking, 2025

    Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reason- ing models can be effective without thinking, 2025. URL https://arxiv.org/abs/2504. 09858

  42. [42]

    Bayesian decision problems and markov chains.(No Title), 1967

    James John Martin. Bayesian decision problems and markov chains.(No Title), 1967

  43. [43]

    Learning to ask informative questions: Enhancing LLMs with preference optimization and expected information gain

    Davide Mazzaccara, Alberto Testoni, and Raffaella Bernardi. Learning to ask informative questions: Enhancing LLMs with preference optimization and expected information gain. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5064–5074, Miami, Florida, USA, November

  44. [44]

    doi: 10.18653/v1/2024.findings-emnlp.291

    Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.291. URLhttps://aclanthology.org/2024.findings-emnlp.291/

  45. [45]

    Stepwise versus globally optimal search in children and adults.Cognition, 191:103965, 2019

    Björn Meder, Jonathan D Nelson, Matt Jones, and Azzurra Ruggeri. Stepwise versus globally optimal search in children and adults.Cognition, 191:103965, 2019

  46. [46]

    Owen-Shapley Policy Optimization: A Principled RL Algorithm for Generative Search LLMs

    Abhijnan Nath, Alireza Bagheri Garakani, Tianchen Zhou, Fan Yang, and Nikhil Krishnaswamy. Owen-shapley policy optimization (ospo): A principled rl algorithm for generative search llms. arXiv preprint arXiv:2601.08403, 2026

  47. [47]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022

  48. [48]

    Pearson Education India, 1999

    Alan V Oppenheim.Discrete-time signal processing. Pearson Education India, 1999

  49. [49]

    Performative thinking? the brittle correlation between cot length and problem complexity

    Vardhan Palod, Karthik Valmeekam, Kaya Stechly, and Subbarao Kambhampati. Performative thinking? the brittle correlation between cot length and problem complexity. InNeurIPS 2025 Workshop on Efficient Reasoning

  50. [50]

    The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning, 2025

  51. [51]

    The threshold approach to clinical decision making

    Stephen G Pauker and Jerome P Kassirer. The threshold approach to clinical decision making. New England Journal of Medicine, 302(20):1109–1117, 1980

  52. [52]

    Information foraging.Psychological review, 106(4):643, 1999

    Peter Pirolli and Stuart Card. Information foraging.Psychological review, 106(4):643, 1999

  53. [53]

    Stochastic backpropagation and approximate inference in deep generative models

    Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. InInternational conference on machine learning, pages 1278–1286. PMLR, 2014

  54. [54]

    Do people ask good questions? Computational Brain & Behavior, 1(1):69–89, 2018

    Anselm Rothe, Brenden M Lake, and Todd M Gureckis. Do people ask good questions? Computational Brain & Behavior, 1(1):69–89, 2018

  55. [55]

    Local coherence or global validity? investigating rlvr traces in math domains,

    Soumya Rani Samineni, Durgesh Kalwar, Vardaan Gangal, Siddhant Bhambri, and Subbarao Kambhampati. Local coherence or global validity? investigating rlvr traces in math domains,

  56. [56]

    URLhttps://arxiv.org/abs/2510.18176

  57. [57]

    Rl in name only? analyzing the structural assumptions in rl post-training for llms, 2026

    Soumya Rani Samineni, Durgesh Kalwar, Karthik Valmeekam, Kaya Stechly, and Subbarao Kambhampati. Rl in name only? analyzing the structural assumptions in rl post-training for llms, 2026. URLhttps://arxiv.org/abs/2505.13697

  58. [58]

    Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

  59. [59]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015. 13

  60. [60]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

  61. [61]

    Active learning literature survey

    Burr Settles. Active learning literature survey. 2009

  62. [62]

    Communication in the presence of noise.Proceedings of the IRE, 37(1): 10–21, 2006

    Claude E Shannon. Communication in the presence of noise.Proceedings of the IRE, 37(1): 10–21, 2006

  63. [63]

    Do llm agents know how to ground, recover, and assess? a benchmark for epistemic competence in information- seeking agents.arXiv preprint arXiv:2509.22391, 2025

    Jiaqi Shao, Yuxiang Lin, Munish Prasad Lohani, Yufeng Miao, and Bing Luo. Do llm agents know how to ground, recover, and assess? a benchmark for epistemic competence in information- seeking agents.arXiv preprint arXiv:2509.22391, 2025

  64. [64]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  65. [65]

    The optimal control of partially observable markov processes over a finite horizon.Operations research, 21(5):1071–1088, 1973

    Richard D Smallwood and Edward J Sondik. The optimal control of partially observable markov processes over a finite horizon.Operations research, 21(5):1071–1088, 1973

  66. [66]

    Large language model reasoning failures

    Peiyang Song, Pengrui Han, and Noah Goodman. Large language model reasoning failures. arXiv preprint arXiv:2602.06176, 2026

  67. [67]

    Policy learning with a language bottleneck.arXiv preprint arXiv:2405.04118, 2024

    Megha Srivastava, Cédric Colas, Dorsa Sadigh, and Jacob Andreas. Policy learning with a language bottleneck.arXiv preprint arXiv:2405.04118, 2024

  68. [68]

    MIT press, 2018

    Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction. MIT press, 2018

  69. [69]

    University of Massachusetts Amherst, 1984

    Richard Stuart Sutton.Temporal credit assignment in reinforcement learning. University of Massachusetts Amherst, 1984

  70. [70]

    Williams

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist re- inforcement learning.Mach. Learn., 8(3–4):229–256, may 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URLhttps://doi.org/10.1007/BF00992696

  71. [71]

    Analyzing and mitigating interference in neural architecture search, 2021

    Jin Xu, Xu Tan, Kaitao Song, Renqian Luo, Yichong Leng, Tao Qin, Tie-Yan Liu, and Jian Li. Analyzing and mitigating interference in neural architecture search, 2021

  72. [72]

    Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

  73. [73]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  74. [74]

    Tomz, Christopher D

    Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R Tomz, Christopher D Manning, and Weiyan Shi. Verbalized sampling: How to mitigate mode collapse and unlock llm diversity.arXiv preprint arXiv:2510.01171, 2025

  75. [75]

    Beyond markovian: Reflective exploration via bayes-adaptive rl for llm reasoning, 2025

    Shenao Zhang, Yaqing Wang, Yinxiao Liu, Tianqi Liu, Peter Grabowski, Eugene Ie, Zhaoran Wang, and Yunxuan Li. Beyond markovian: Reflective exploration via bayes-adaptive rl for llm reasoning, 2025. URLhttps://arxiv.org/abs/2505.20561

  76. [76]

    BELLE: A bi-level multi-agent reasoning framework for multi-hop question answering

    Taolin Zhang, Dongyang Li, Qizhou Chen, Chengyu Wang, and Xiaofeng He. BELLE: A bi-level multi-agent reasoning framework for multi-hop question answering. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...

  77. [77]

    Probing the multi-turn planning capabili- ties of LLMs via 20 question games

    Yizhe Zhang, Jiarui Lu, and Navdeep Jaitly. Probing the multi-turn planning capabili- ties of LLMs via 20 question games. In Lun-Wei Ku, Andre Martins, and Vivek Sriku- mar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 1495–1516, Bangkok, Thailand, August

  78. [78]

    doi: 10.18653/v1/2024.acl-long.82

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.82. URL https://aclanthology.org/2024.acl-long.82/

  79. [79]

    The entity-deduction arena: A playground for probing the conversational reasoning and planning capabilities of LLMs, 2024

    Yizhe Zhang, Jiarui Lu, and Navdeep Jaitly. The entity-deduction arena: A playground for probing the conversational reasoning and planning capabilities of LLMs, 2024. URL https://openreview.net/forum?id=PfrpYGKGPL

  80. [80]

    Towards end-to-end learning for dialog state tracking and management using deep reinforcement learning

    Tiancheng Zhao and Maxine Eskenazi. Towards end-to-end learning for dialog state tracking and management using deep reinforcement learning. InProceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 1–10, 2016

Showing first 80 references.