arxiv: 2605.07042 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.LG

Recognition: 1 theorem link

· Lean Theorem

The Context Gathering Decision Process: A POMDP Framework for Agentic Search

Chinmaya Kausik , Adith Swaminathan , Nathan Kallus

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:07 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords LLM agentsPOMDPcontext gatheringbelief statemulti-hop reasoningagentic searchexhaustion detectionThompson sampling

0 comments

The pith

Modeling LLM agent search as a POMDP lets explicit belief states replace lossy implicit memory and improve multi-hop reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames the iterative exploration LLM agents perform in large environments as the Context Gathering Decision Process, a specialized POMDP whose goal is to refine a belief state until the task-relevant information is isolated. Current agents suffer degraded working memory because their implicit state inside the context window loses track of what has been searched, producing loops and early stops. The authors decompose this implicit behavior into explicit predicate operations that produce a persistent, bounded belief state and a separate programmatic check for search exhaustion. Replacing the LLM's native state with the CGDP-derived belief state, and adding the exhaustion gate, yields measurable gains on question-answering tasks. A reader cares because the interventions are plug-and-play and do not require retraining the underlying model.

Core claim

The Context Gathering Decision Process (CGDP) is a POMDP in which an agent adaptively gathers and refines a belief state over an environment whose size exceeds any single context window. An LLM's search is modeled as approximate Thompson sampling inside this CGDP; the implicit behavior is then decomposed into modular predicate operations. This decomposition supplies two interventions: a predicate-based belief state that remains persistent across turns while bounding context length, and a programmatic exhaustion gate that stops search only when the predicates indicate no further progress is possible.

What carries the argument

The predicate-based belief state, which encodes gathered context as explicit logical predicates so that multi-hop dependencies remain visible without exhausting the context window.

If this is right

Persistent predicate belief states improve multi-hop reasoning accuracy by as much as 11.4 percent across four methods and three domains.
Programmatic exhaustion detection reduces token consumption by up to 39 percent while leaving task performance unchanged.
The CGDP framing produces modular, non-interfering additions that can be added to existing agent harnesses.
Explicit state tracking prevents both redundant loops and premature stopping in environments larger than the model's context window.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same predicate decomposition could be applied to non-LLM agents that also maintain internal search state.
The exhaustion gate might be combined with external signals such as tool-call success rates to further reduce wasted computation.
If predicates are learned rather than hand-specified, the approach could transfer to new domains without manual engineering.
The CGDP view suggests similar POMDP decompositions for other agent loops such as long-horizon planning or tool-use sequences.

Load-bearing premise

LLM implicit search behavior can be decomposed into explicit predicate operations that preserve the original decision distribution without adding bias or dropping information.

What would settle it

Run the same agent with and without the CGDP belief state on a new multi-hop QA benchmark and measure whether accuracy falls or token count rises when the predicate decomposition is used.

Figures

Figures reproduced from arXiv: 2605.07042 by Adith Swaminathan, Chinmaya Kausik, Nathan Kallus.

**Figure 2.** Figure 2: Pareto frontiers of quality vs. token efficiency across harnesses. Lobotomization (orange) [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Large Language Model (LLM) agents are deployed in complex environments -- such as massive codebases, enterprise databases, and conversational histories -- where the relevant state far exceeds their context windows. To navigate these spaces, an agent must iteratively explore the environment to find relevant information. However, without explicit infrastructure, an agent's working memory can degrade into lossy representations of the search state, resulting in redundant work (e.g. repetitive looping) and premature stopping. In this work, we formalize this challenge as the Context Gathering Decision Process (CGDP), a specialized Partially Observable Markov Decision Process, where an agent's objective is to adaptively refine its belief state to isolate the necessary information for a task. We model an LLM's behavior as approximate Thompson Sampling within this CGDP, and introduce a predicate-based method that decomposes an LLM's implicit search into explicit and modular operations. We then derive two plug-and-play interventions for iterative LLM agents: a persistent, predicate-based belief state that bounds context while preserving multi-hop reasoning, and a programmatic exhaustion gate that halts unproductive search without premature stopping. Across four methods and three question-answering domains, we empirically validate that replacing an LLM's implicit state with our CGDP-motivated belief state improves multi-hop reasoning by up to $11.4\%$; while the modular programmatic exhaustion detection saves up to $39\%$ of tokens without any degradation in agent performance. Ultimately, we argue that framing the LLM agent loop as a CGDP can guide the design of modular, non-interfering improvements to agentic search harnesses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The CGDP POMDP framing plus predicate belief state and exhaustion gate give a modular way to organize LLM agent search, backed by concrete gains on QA tasks.

read the letter

The main takeaway is that this paper frames context gathering for LLM agents as a specialized POMDP and derives two lightweight, plug-in modules from it that improve multi-hop reasoning and cut token use on standard question-answering benchmarks. They model the agent's implicit search as approximate Thompson sampling inside the CGDP, then decompose it into explicit predicate operations to create a persistent belief state and a programmatic exhaustion detector. These replace the usual lossy working memory without rewriting the whole agent loop. The experiments run the interventions across four methods and three domains, reporting up to 11.4% better multi-hop performance from the belief state and up to 39% token savings from the gate, with no accuracy drop. The numbers are reported plainly and the modules are presented as additions rather than replacements, which makes the practical angle clear. The soft spot is the untested assumption that the predicate decomposition preserves the original LLM's decision distribution. The paper describes the mapping from the POMDP model but does not include a direct side-by-side check of whether the explicit predicates produce the same next-query or stop decisions the baseline would have made on identical states. If the schema or the Thompson sampling approximation shifts behavior, part of the measured gains could trace to altered prompting rather than the framework itself. No code or full predicate schemas are released yet either. This is useful for anyone building or studying agents that must search large contexts like codebases or enterprise data. The formal model is clean, the empirical results are concrete across baselines, and the interventions are easy to test, so it deserves a serious referee even if the decomposition validation needs more direct evidence.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the Context Gathering Decision Process (CGDP) as a specialized POMDP for LLM agents navigating environments larger than their context windows. It models LLM iterative search as approximate Thompson Sampling, derives a predicate-based decomposition of implicit search into explicit modular operations, and proposes two plug-and-play interventions: a persistent predicate-based belief state and a programmatic exhaustion gate. Empirical evaluation across four methods and three QA domains reports up to 11.4% gains in multi-hop reasoning and up to 39% token savings without performance loss.

Significance. If the results hold, the work supplies a formal POMDP lens that can systematically guide modular improvements to agentic search harnesses, addressing common failure modes such as redundant looping and premature stopping. The broad evaluation across multiple baselines and domains, together with the parameter-free character of the derived interventions, gives the claims concrete practical weight for LLM agent design.

major comments (1)

[§4] §4 (Predicate-based belief state and exhaustion gate): The central performance claims require that the explicit predicate operations and exhaustion gate preserve the original LLM decision distribution. No direct measurement or ablation is reported that compares the next-query or stop decisions produced by the predicate belief state against those of the baseline LLM on identical intermediate histories. Without this check, the reported 11.4% and 39% gains could partly reflect prompting-format changes rather than faithful CGDP interventions.

minor comments (2)

[§3] The POMDP tuple (states, actions, transition, reward, observation, observation function) for the CGDP is described at a high level; spelling out the concrete definitions of the observation and belief-update operators would improve traceability from model to implementation.
[§5] Table 1 (or equivalent results table) reports aggregate percentages; adding per-domain and per-method breakdowns with confidence intervals would make the cross-domain claim easier to assess.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The concern about directly verifying that the predicate-based interventions preserve the LLM's original decision distribution is well-taken, and we will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [§4] §4 (Predicate-based belief state and exhaustion gate): The central performance claims require that the explicit predicate operations and exhaustion gate preserve the original LLM decision distribution. No direct measurement or ablation is reported that compares the next-query or stop decisions produced by the predicate belief state against those of the baseline LLM on identical intermediate histories. Without this check, the reported 11.4% and 39% gains could partly reflect prompting-format changes rather than faithful CGDP interventions.

Authors: We agree that a direct comparison of next-query and stop decisions on matched histories would provide stronger evidence that gains derive from the CGDP structure rather than format artifacts. The predicate belief state is populated by the same LLM used in the baseline (via explicit extraction of predicates from its outputs), and the exhaustion gate is a deterministic check on predicate coverage; neither alters the underlying LLM query-generation prompt. Nevertheless, the current manuscript does not report the requested side-by-side ablation. In the revision we will add a targeted experiment: for a random sample of intermediate trajectories we will (i) render the current predicate belief as a concise natural-language summary, (ii) prompt the baseline LLM with that summary plus the original task, and (iii) compare its next-query and stop decisions to those produced by the explicit CGDP policy on the identical predicate state. Any divergence will be quantified and discussed; we expect the comparison to confirm that the reported improvements stem from persistent, modular state representation rather than prompting differences. revision: yes

Circularity Check

0 steps flagged

Empirical results independent of modeling assumptions

full rationale

The paper defines the CGDP as a POMDP, states the approximate Thompson Sampling model as an assumption, and designs predicate-based belief state plus exhaustion gate as interventions derived from that framework. The headline performance claims (11.4% multi-hop gain, 39% token savings) are measured directly in new experiments on standard QA tasks rather than obtained by fitting parameters to the same data or by reducing to prior self-citations. No equation or derivation step equates a reported outcome to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claims rest on treating LLM search as approximate Thompson sampling inside a custom POMDP and on the claim that predicates can decompose that search without loss; no numerical free parameters are fitted in the reported results.

axioms (1)

domain assumption LLM agent behavior in context-gathering tasks can be modeled as approximate Thompson sampling within a POMDP whose belief state tracks missing information.
Explicitly stated as the modeling choice that enables the subsequent predicate decomposition and interventions.

invented entities (3)

Context Gathering Decision Process (CGDP) no independent evidence
purpose: Specialized POMDP that captures iterative refinement of an agent's belief about relevant information in oversized environments.
Newly defined in the paper as the core formal object.
predicate-based belief state no independent evidence
purpose: Persistent, bounded representation that replaces the LLM's implicit working memory while preserving multi-hop reasoning.
Derived directly from the CGDP model and introduced as a plug-and-play component.
programmatic exhaustion gate no independent evidence
purpose: Modular check that halts search when further exploration is unproductive without causing premature stopping.
Second intervention derived from the CGDP framing.

pith-pipeline@v0.9.0 · 5591 in / 1698 out tokens · 53946 ms · 2026-05-11T01:07:55.660635+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize this challenge as the Context Gathering Decision Process (CGDP), a specialized Partially Observable Markov Decision Process... model an LLM’s behavior as approximate Thompson Sampling within this CGDP, and introduce a predicate-based method...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 7 canonical work pages · 4 internal anchors

[1]

FAIR-RAG: Faithful adaptive iterative refinement for retrieval-augmented generation.arXiv preprint arXiv:2510.22344, 2025

Mohammad Aghajani Asl, Majid Asgari-Bidhendi, and Behrooz Minaei-Bidgoli. FAIR-RAG: Faithful adaptive iterative refinement for retrieval-augmented generation.arXiv preprint arXiv:2510.22344, 2025

work page arXiv 2025
[2]

Self-RAG: Learning to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. InICLR, 2024

2024
[3]

Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

2002
[4]

A close look into the calibration of pre-trained language models

Yangyi Chen, Lifan Yuan, Ganqu Cui, Zhiyuan Liu, and Heng Ji. A close look into the calibration of pre-trained language models. InACL, pages 1343–1367, 2023

2023
[5]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2025

work page internal anchor Pith review arXiv 2025
[6]

Don’t hallucinate, abstain: Identifying LLM knowledge gaps via multi-LLM collaboration.ACL, 2024

Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, and Yu- lia Tsvetkov. Don’t hallucinate, abstain: Identifying LLM knowledge gaps via multi-LLM collaboration.ACL, 2024

2024
[7]

Optimal best arm identification with fixed confidence

Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. InCOLT, pages 998–1027, 2016

2016
[8]

HippoRAG: Neurobiologically inspired long-term memory for large language models

Bernal Jimenez Gutierrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InNeurIPS, 2024

2024
[9]

Large language models cannot self-correct reasoning yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. InICLR, 2024

2024
[10]

Active retrieval augmented generation

Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InEMNLP, pages 7969–7992, 2023

2023
[11]

Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan O. Arik. Long-context LLMs meet RAG: Overcoming challenges for long inputs in RAG. InICLR, 2025

2025
[12]

Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InCOLM, 2025

2025
[13]

Littman, and Anthony R

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1):99–134, 1998. 10

1998
[14]

LLMs get lost in multi- turn conversation

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. LLMs get lost in multi- turn conversation. InICLR, 2026

2026
[15]

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

Yoonsang Lee, Howard Yen, Xi Ye, and Danqi Chen. Agentic aggregation for parallel scaling of long-horizon agentic tasks.arXiv preprint arXiv:2604.11753, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InNeurIPS, 2020

2020
[17]

LooGLE: Can long-context language models understand long contexts? InACL, pages 16304–16333, 2024

Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. LooGLE: Can long-context language models understand long contexts? InACL, pages 16304–16333, 2024

2024
[18]

Search-o1: Agentic search-enhanced large reasoning models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InEMNLP, pages 5420–5438, 2025

2025
[19]

WebThinker: Empowering large reasoning models with deep research capability

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. WebThinker: Empowering large reasoning models with deep research capability. InNeurIPS, 2026

2026
[20]

Ho- pRAG: Multi-hop reasoning for logic-aware retrieval-augmented generation

Hao Liu, Zhengren Wang, Xi Chen, Zhiyu Li, Feiyu Xiong, Qinhan Yu, and Wentao Zhang. Ho- pRAG: Multi-hop reasoning for logic-aware retrieval-augmented generation. InACL Findings, pages 1897–1913, 2025

1913
[21]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

2024
[22]

Evaluating very long-term conversational memory of LLM agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InACL, pages 13851–13870, 2024

2024
[23]

Solving a million-step LLM task with zero errors.arXiv preprint arXiv:2511.09030, 2025

Elliot Meyerson, Giuseppe Paolo, Roberto Dailey, Hormoz Shahrzad, Olivier Francon, Conor F Hayes, Xin Qiu, Babak Hodjat, and Risto Miikkulainen. Solving a million-step LLM task with zero errors.arXiv preprint arXiv:2511.09030, 2025

work page arXiv 2025
[24]

Self-contradictory hallucina- tions of large language models: Evaluation, detection and mitigation.ICLR, 2024

Niels Mündler, Jingxuan He, Slobodan Jenko, and Martin Vechev. Self-contradictory hallucina- tions of large language models: Evaluation, detection and mitigation.ICLR, 2024

2024
[25]

(more) efficient reinforcement learning via posterior sampling

Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. InNeurIPS, 2013

2013
[26]

MemGPT: Towards LLMs as operating systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. MemGPT: Towards LLMs as operating systems. InICLR, 2024

2024
[27]

Generative agents: Interactive simulacra of human behavior.UIST, 2023

Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior.UIST, 2023

2023
[28]

SWE-QA: Can Language Models Answer Repository-level Code Questions?

Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, and Xiaodong Gu. SWE-QA: Can language models answer repository-level code questions?arXiv preprint arXiv:2509.14635, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

StateAct: Enhancing LLM base agents via self-prompting and state-tracking

Nikolai Rozanov and Marek Rei. StateAct: Enhancing LLM base agents via self-prompting and state-tracking. InREALM, pages 367–385, 2025

2025
[30]

Simple bayesian algorithms for best arm identification

Daniel Russo. Simple bayesian algorithms for best arm identification. InCOLT, pages 1417– 1418, 2016

2016
[31]

A tutorial on thompson sampling.Foundations and Trends in Machine Learning, 2018

Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling.Foundations and Trends in Machine Learning, 2018

2018
[32]

Lost in transmis- sion: When and why LLMs fail to reason globally

Tobias Schnabel, Kiran Tomlinson, Adith Swaminathan, and Jennifer Neville. Lost in transmis- sion: When and why LLMs fail to reason globally. InNeurIPS, 2025. 11

2025
[33]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. InEMNLP Findings, pages 9248–9274, 2023

2023
[34]

DRAGIN: Dynamic retrieval augmented generation based on the real-time information needs of large language models

Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, and Yiqun Liu. DRAGIN: Dynamic retrieval augmented generation based on the real-time information needs of large language models. InACL, pages 12991–13013, 2024

2024
[35]

On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25:285–294, 1933

William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25:285–294, 1933

1933
[36]

MuSiQue: Multihop questions via single hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multihop questions via single hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

2022
[37]

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. InACL, pages 10014–10037, 2023

2023
[38]

Extending context window of large language models from a distributional perspective

Yingsheng Wu, Yuxuan Gu, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, and Bing Qin. Extending context window of large language models from a distributional perspective. InEMNLP, pages 7288–7301, 2024

2024
[39]

A-Mem: Agentic memory for LLM agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-Mem: Agentic memory for LLM agents. InNeurIPS, 2025

2025
[40]

Corrective Retrieval Augmented Generation

Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation.arXiv preprint arXiv:2401.15884, 2024

work page internal anchor Pith review arXiv 2024
[41]

SWE-agent: Agent-computer interfaces enable automated software engineering

John Yang, Carlos Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InNeurIPS, pages 50528–50652, 2024

2024
[42]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InICLR, 2023

2023
[43]

Lost in the maze: Overcoming context limitations in long-horizon agentic search.arXiv preprint arXiv:2510.18939, 2025

Howard Yen, Ashwin Paranjape, Mengzhou Xia, Thejas Venkatesh, Jack Hessel, Danqi Chen, and Yuhao Zhang. Lost in the maze: Overcoming context limitations in long-horizon agentic search.arXiv preprint arXiv:2510.18939, 2025

work page arXiv 2025
[44]

I did not find it

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. DeepResearcher: Scaling deep research via reinforcement learning in real-world environments. InEMNLP, pages 414–431, 2025. A Observed Failure Modes and Example Traces A.1 Observed Failure Modes This section summarizes the main failure modes (G1–G7) surfaced by ...

2025
[45]

Contains the global best (f_j0.6_u0.3_p2)

Conservative discrete( U=0.3, p=2–3): moderate fire rates, fewer negatives. Contains the global best (f_j0.6_u0.3_p2)
[46]

Three pairs of configurations are functionally identical (L1 = 0pp): thefull vs.query_and_full trigger mode makes no difference when the diff threshold is not binding

Aggressive smooth( U=0.10–0.15, p=2): high fire rates, more negatives on non-IRCoT methods but larger gains on IRCoT. Three pairs of configurations are functionally identical (L1 = 0pp): thefull vs.query_and_full trigger mode makes no difference when the diff threshold is not binding. The global config is near-optimal: L1 distance between the global and p...