pith. machine review for the scientific record. sign in

arxiv: 2605.07042 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.LG

Recognition: 1 theorem link

· Lean Theorem

The Context Gathering Decision Process: A POMDP Framework for Agentic Search

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:07 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords LLM agentsPOMDPcontext gatheringbelief statemulti-hop reasoningagentic searchexhaustion detectionThompson sampling
0
0 comments X

The pith

Modeling LLM agent search as a POMDP lets explicit belief states replace lossy implicit memory and improve multi-hop reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames the iterative exploration LLM agents perform in large environments as the Context Gathering Decision Process, a specialized POMDP whose goal is to refine a belief state until the task-relevant information is isolated. Current agents suffer degraded working memory because their implicit state inside the context window loses track of what has been searched, producing loops and early stops. The authors decompose this implicit behavior into explicit predicate operations that produce a persistent, bounded belief state and a separate programmatic check for search exhaustion. Replacing the LLM's native state with the CGDP-derived belief state, and adding the exhaustion gate, yields measurable gains on question-answering tasks. A reader cares because the interventions are plug-and-play and do not require retraining the underlying model.

Core claim

The Context Gathering Decision Process (CGDP) is a POMDP in which an agent adaptively gathers and refines a belief state over an environment whose size exceeds any single context window. An LLM's search is modeled as approximate Thompson sampling inside this CGDP; the implicit behavior is then decomposed into modular predicate operations. This decomposition supplies two interventions: a predicate-based belief state that remains persistent across turns while bounding context length, and a programmatic exhaustion gate that stops search only when the predicates indicate no further progress is possible.

What carries the argument

The predicate-based belief state, which encodes gathered context as explicit logical predicates so that multi-hop dependencies remain visible without exhausting the context window.

If this is right

  • Persistent predicate belief states improve multi-hop reasoning accuracy by as much as 11.4 percent across four methods and three domains.
  • Programmatic exhaustion detection reduces token consumption by up to 39 percent while leaving task performance unchanged.
  • The CGDP framing produces modular, non-interfering additions that can be added to existing agent harnesses.
  • Explicit state tracking prevents both redundant loops and premature stopping in environments larger than the model's context window.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same predicate decomposition could be applied to non-LLM agents that also maintain internal search state.
  • The exhaustion gate might be combined with external signals such as tool-call success rates to further reduce wasted computation.
  • If predicates are learned rather than hand-specified, the approach could transfer to new domains without manual engineering.
  • The CGDP view suggests similar POMDP decompositions for other agent loops such as long-horizon planning or tool-use sequences.

Load-bearing premise

LLM implicit search behavior can be decomposed into explicit predicate operations that preserve the original decision distribution without adding bias or dropping information.

What would settle it

Run the same agent with and without the CGDP belief state on a new multi-hop QA benchmark and measure whether accuracy falls or token count rises when the predicate decomposition is used.

Figures

Figures reproduced from arXiv: 2605.07042 by Adith Swaminathan, Chinmaya Kausik, Nathan Kallus.

Figure 1
Figure 1. Figure 1: The PBAI loop with our two harness interven [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pareto frontiers of quality vs. token efficiency across harnesses. Lobotomization (orange) [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

Large Language Model (LLM) agents are deployed in complex environments -- such as massive codebases, enterprise databases, and conversational histories -- where the relevant state far exceeds their context windows. To navigate these spaces, an agent must iteratively explore the environment to find relevant information. However, without explicit infrastructure, an agent's working memory can degrade into lossy representations of the search state, resulting in redundant work (e.g. repetitive looping) and premature stopping. In this work, we formalize this challenge as the Context Gathering Decision Process (CGDP), a specialized Partially Observable Markov Decision Process, where an agent's objective is to adaptively refine its belief state to isolate the necessary information for a task. We model an LLM's behavior as approximate Thompson Sampling within this CGDP, and introduce a predicate-based method that decomposes an LLM's implicit search into explicit and modular operations. We then derive two plug-and-play interventions for iterative LLM agents: a persistent, predicate-based belief state that bounds context while preserving multi-hop reasoning, and a programmatic exhaustion gate that halts unproductive search without premature stopping. Across four methods and three question-answering domains, we empirically validate that replacing an LLM's implicit state with our CGDP-motivated belief state improves multi-hop reasoning by up to $11.4\%$; while the modular programmatic exhaustion detection saves up to $39\%$ of tokens without any degradation in agent performance. Ultimately, we argue that framing the LLM agent loop as a CGDP can guide the design of modular, non-interfering improvements to agentic search harnesses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the Context Gathering Decision Process (CGDP) as a specialized POMDP for LLM agents navigating environments larger than their context windows. It models LLM iterative search as approximate Thompson Sampling, derives a predicate-based decomposition of implicit search into explicit modular operations, and proposes two plug-and-play interventions: a persistent predicate-based belief state and a programmatic exhaustion gate. Empirical evaluation across four methods and three QA domains reports up to 11.4% gains in multi-hop reasoning and up to 39% token savings without performance loss.

Significance. If the results hold, the work supplies a formal POMDP lens that can systematically guide modular improvements to agentic search harnesses, addressing common failure modes such as redundant looping and premature stopping. The broad evaluation across multiple baselines and domains, together with the parameter-free character of the derived interventions, gives the claims concrete practical weight for LLM agent design.

major comments (1)
  1. [§4] §4 (Predicate-based belief state and exhaustion gate): The central performance claims require that the explicit predicate operations and exhaustion gate preserve the original LLM decision distribution. No direct measurement or ablation is reported that compares the next-query or stop decisions produced by the predicate belief state against those of the baseline LLM on identical intermediate histories. Without this check, the reported 11.4% and 39% gains could partly reflect prompting-format changes rather than faithful CGDP interventions.
minor comments (2)
  1. [§3] The POMDP tuple (states, actions, transition, reward, observation, observation function) for the CGDP is described at a high level; spelling out the concrete definitions of the observation and belief-update operators would improve traceability from model to implementation.
  2. [§5] Table 1 (or equivalent results table) reports aggregate percentages; adding per-domain and per-method breakdowns with confidence intervals would make the cross-domain claim easier to assess.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The concern about directly verifying that the predicate-based interventions preserve the LLM's original decision distribution is well-taken, and we will strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4] §4 (Predicate-based belief state and exhaustion gate): The central performance claims require that the explicit predicate operations and exhaustion gate preserve the original LLM decision distribution. No direct measurement or ablation is reported that compares the next-query or stop decisions produced by the predicate belief state against those of the baseline LLM on identical intermediate histories. Without this check, the reported 11.4% and 39% gains could partly reflect prompting-format changes rather than faithful CGDP interventions.

    Authors: We agree that a direct comparison of next-query and stop decisions on matched histories would provide stronger evidence that gains derive from the CGDP structure rather than format artifacts. The predicate belief state is populated by the same LLM used in the baseline (via explicit extraction of predicates from its outputs), and the exhaustion gate is a deterministic check on predicate coverage; neither alters the underlying LLM query-generation prompt. Nevertheless, the current manuscript does not report the requested side-by-side ablation. In the revision we will add a targeted experiment: for a random sample of intermediate trajectories we will (i) render the current predicate belief as a concise natural-language summary, (ii) prompt the baseline LLM with that summary plus the original task, and (iii) compare its next-query and stop decisions to those produced by the explicit CGDP policy on the identical predicate state. Any divergence will be quantified and discussed; we expect the comparison to confirm that the reported improvements stem from persistent, modular state representation rather than prompting differences. revision: yes

Circularity Check

0 steps flagged

Empirical results independent of modeling assumptions

full rationale

The paper defines the CGDP as a POMDP, states the approximate Thompson Sampling model as an assumption, and designs predicate-based belief state plus exhaustion gate as interventions derived from that framework. The headline performance claims (11.4% multi-hop gain, 39% token savings) are measured directly in new experiments on standard QA tasks rather than obtained by fitting parameters to the same data or by reducing to prior self-citations. No equation or derivation step equates a reported outcome to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claims rest on treating LLM search as approximate Thompson sampling inside a custom POMDP and on the claim that predicates can decompose that search without loss; no numerical free parameters are fitted in the reported results.

axioms (1)
  • domain assumption LLM agent behavior in context-gathering tasks can be modeled as approximate Thompson sampling within a POMDP whose belief state tracks missing information.
    Explicitly stated as the modeling choice that enables the subsequent predicate decomposition and interventions.
invented entities (3)
  • Context Gathering Decision Process (CGDP) no independent evidence
    purpose: Specialized POMDP that captures iterative refinement of an agent's belief about relevant information in oversized environments.
    Newly defined in the paper as the core formal object.
  • predicate-based belief state no independent evidence
    purpose: Persistent, bounded representation that replaces the LLM's implicit working memory while preserving multi-hop reasoning.
    Derived directly from the CGDP model and introduced as a plug-and-play component.
  • programmatic exhaustion gate no independent evidence
    purpose: Modular check that halts search when further exploration is unproductive without causing premature stopping.
    Second intervention derived from the CGDP framing.

pith-pipeline@v0.9.0 · 5591 in / 1698 out tokens · 53946 ms · 2026-05-11T01:07:55.660635+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 7 canonical work pages · 4 internal anchors

  1. [1]

    FAIR-RAG: Faithful adaptive iterative refinement for retrieval-augmented generation.arXiv preprint arXiv:2510.22344, 2025

    Mohammad Aghajani Asl, Majid Asgari-Bidhendi, and Behrooz Minaei-Bidgoli. FAIR-RAG: Faithful adaptive iterative refinement for retrieval-augmented generation.arXiv preprint arXiv:2510.22344, 2025

  2. [2]

    Self-RAG: Learning to retrieve, generate, and critique through self-reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. InICLR, 2024

  3. [3]

    Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

    Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

  4. [4]

    A close look into the calibration of pre-trained language models

    Yangyi Chen, Lifan Yuan, Ganqu Cui, Zhiyuan Liu, and Heng Ji. A close look into the calibration of pre-trained language models. InACL, pages 1343–1367, 2023

  5. [5]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2025

  6. [6]

    Don’t hallucinate, abstain: Identifying LLM knowledge gaps via multi-LLM collaboration.ACL, 2024

    Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, and Yu- lia Tsvetkov. Don’t hallucinate, abstain: Identifying LLM knowledge gaps via multi-LLM collaboration.ACL, 2024

  7. [7]

    Optimal best arm identification with fixed confidence

    Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. InCOLT, pages 998–1027, 2016

  8. [8]

    HippoRAG: Neurobiologically inspired long-term memory for large language models

    Bernal Jimenez Gutierrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InNeurIPS, 2024

  9. [9]

    Large language models cannot self-correct reasoning yet

    Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. InICLR, 2024

  10. [10]

    Active retrieval augmented generation

    Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InEMNLP, pages 7969–7992, 2023

  11. [11]

    Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan O. Arik. Long-context LLMs meet RAG: Overcoming challenges for long inputs in RAG. InICLR, 2025

  12. [12]

    Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InCOLM, 2025

  13. [13]

    Littman, and Anthony R

    Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1):99–134, 1998. 10

  14. [14]

    LLMs get lost in multi- turn conversation

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. LLMs get lost in multi- turn conversation. InICLR, 2026

  15. [15]

    Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

    Yoonsang Lee, Howard Yen, Xi Ye, and Danqi Chen. Agentic aggregation for parallel scaling of long-horizon agentic tasks.arXiv preprint arXiv:2604.11753, 2026

  16. [16]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InNeurIPS, 2020

  17. [17]

    LooGLE: Can long-context language models understand long contexts? InACL, pages 16304–16333, 2024

    Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. LooGLE: Can long-context language models understand long contexts? InACL, pages 16304–16333, 2024

  18. [18]

    Search-o1: Agentic search-enhanced large reasoning models

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InEMNLP, pages 5420–5438, 2025

  19. [19]

    WebThinker: Empowering large reasoning models with deep research capability

    Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. WebThinker: Empowering large reasoning models with deep research capability. InNeurIPS, 2026

  20. [20]

    Ho- pRAG: Multi-hop reasoning for logic-aware retrieval-augmented generation

    Hao Liu, Zhengren Wang, Xi Chen, Zhiyu Li, Feiyu Xiong, Qinhan Yu, and Wentao Zhang. Ho- pRAG: Multi-hop reasoning for logic-aware retrieval-augmented generation. InACL Findings, pages 1897–1913, 2025

  21. [21]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

  22. [22]

    Evaluating very long-term conversational memory of LLM agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InACL, pages 13851–13870, 2024

  23. [23]

    Solving a million-step LLM task with zero errors.arXiv preprint arXiv:2511.09030, 2025

    Elliot Meyerson, Giuseppe Paolo, Roberto Dailey, Hormoz Shahrzad, Olivier Francon, Conor F Hayes, Xin Qiu, Babak Hodjat, and Risto Miikkulainen. Solving a million-step LLM task with zero errors.arXiv preprint arXiv:2511.09030, 2025

  24. [24]

    Self-contradictory hallucina- tions of large language models: Evaluation, detection and mitigation.ICLR, 2024

    Niels Mündler, Jingxuan He, Slobodan Jenko, and Martin Vechev. Self-contradictory hallucina- tions of large language models: Evaluation, detection and mitigation.ICLR, 2024

  25. [25]

    (more) efficient reinforcement learning via posterior sampling

    Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. InNeurIPS, 2013

  26. [26]

    MemGPT: Towards LLMs as operating systems

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. MemGPT: Towards LLMs as operating systems. InICLR, 2024

  27. [27]

    Generative agents: Interactive simulacra of human behavior.UIST, 2023

    Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior.UIST, 2023

  28. [28]

    SWE-QA: Can Language Models Answer Repository-level Code Questions?

    Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, and Xiaodong Gu. SWE-QA: Can language models answer repository-level code questions?arXiv preprint arXiv:2509.14635, 2025

  29. [29]

    StateAct: Enhancing LLM base agents via self-prompting and state-tracking

    Nikolai Rozanov and Marek Rei. StateAct: Enhancing LLM base agents via self-prompting and state-tracking. InREALM, pages 367–385, 2025

  30. [30]

    Simple bayesian algorithms for best arm identification

    Daniel Russo. Simple bayesian algorithms for best arm identification. InCOLT, pages 1417– 1418, 2016

  31. [31]

    A tutorial on thompson sampling.Foundations and Trends in Machine Learning, 2018

    Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling.Foundations and Trends in Machine Learning, 2018

  32. [32]

    Lost in transmis- sion: When and why LLMs fail to reason globally

    Tobias Schnabel, Kiran Tomlinson, Adith Swaminathan, and Jennifer Neville. Lost in transmis- sion: When and why LLMs fail to reason globally. InNeurIPS, 2025. 11

  33. [33]

    Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

    Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. InEMNLP Findings, pages 9248–9274, 2023

  34. [34]

    DRAGIN: Dynamic retrieval augmented generation based on the real-time information needs of large language models

    Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, and Yiqun Liu. DRAGIN: Dynamic retrieval augmented generation based on the real-time information needs of large language models. InACL, pages 12991–13013, 2024

  35. [35]

    On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25:285–294, 1933

    William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25:285–294, 1933

  36. [36]

    MuSiQue: Multihop questions via single hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multihop questions via single hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  37. [37]

    Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. InACL, pages 10014–10037, 2023

  38. [38]

    Extending context window of large language models from a distributional perspective

    Yingsheng Wu, Yuxuan Gu, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, and Bing Qin. Extending context window of large language models from a distributional perspective. InEMNLP, pages 7288–7301, 2024

  39. [39]

    A-Mem: Agentic memory for LLM agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-Mem: Agentic memory for LLM agents. InNeurIPS, 2025

  40. [40]

    Corrective Retrieval Augmented Generation

    Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation.arXiv preprint arXiv:2401.15884, 2024

  41. [41]

    SWE-agent: Agent-computer interfaces enable automated software engineering

    John Yang, Carlos Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InNeurIPS, pages 50528–50652, 2024

  42. [42]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InICLR, 2023

  43. [43]

    Lost in the maze: Overcoming context limitations in long-horizon agentic search.arXiv preprint arXiv:2510.18939, 2025

    Howard Yen, Ashwin Paranjape, Mengzhou Xia, Thejas Venkatesh, Jack Hessel, Danqi Chen, and Yuhao Zhang. Lost in the maze: Overcoming context limitations in long-horizon agentic search.arXiv preprint arXiv:2510.18939, 2025

  44. [44]

    I did not find it

    Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. DeepResearcher: Scaling deep research via reinforcement learning in real-world environments. InEMNLP, pages 414–431, 2025. A Observed Failure Modes and Example Traces A.1 Observed Failure Modes This section summarizes the main failure modes (G1–G7) surfaced by ...

  45. [45]

    Contains the global best (f_j0.6_u0.3_p2)

    Conservative discrete( U=0.3, p=2–3): moderate fire rates, fewer negatives. Contains the global best (f_j0.6_u0.3_p2)

  46. [46]

    Three pairs of configurations are functionally identical (L1 = 0pp): thefull vs.query_and_full trigger mode makes no difference when the diff threshold is not binding

    Aggressive smooth( U=0.10–0.15, p=2): high fire rates, more negatives on non-IRCoT methods but larger gains on IRCoT. Three pairs of configurations are functionally identical (L1 = 0pp): thefull vs.query_and_full trigger mode makes no difference when the diff threshold is not binding. The global config is near-optimal: L1 distance between the global and p...