SeekerGym: A Benchmark for Reliable Information Seeking
Pith reviewed 2026-05-10 06:22 UTC · model grok-4.3
The pith
SeekerGym measures how completely AI agents retrieve every passage from a full document.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SeekerGym defines tasks in which an agent must retrieve passages from a complete document by issuing queries, using the document's own sections as the measure of full coverage. On Wikipedia the strongest methods recover 42.5 percent of passages while on ML survey papers they recover 29.2 percent, and agents are scored separately on how well they estimate the completeness of their own retrievals.
What carries the argument
SeekerGym benchmark in which each task supplies a full document and the agent must issue queries to retrieve its passages, with coverage measured directly against the document's sections.
Load-bearing premise
A single document such as a Wikipedia article or survey paper contains every relevant piece of information on its topic.
What would settle it
A controlled test in which an agent retrieves every section of the source document while the document is independently shown to omit key facts on the topic would show the benchmark does not track true completeness.
Figures
read the original abstract
Despite their substantial successes, AI agents continue to face fundamental challenges in terms of trustworthiness. Consider deep research agents, tasked with searching for information relevant to a given topic-while AI agents can perform effective information retrieval, there is little guarantee regarding the completeness of this information. Gaps in retrieved information can leave biases that mislead users even if the information they are given is correct and relevant. We introduce SeekerGym, a benchmark designed to evaluate the completeness of information retrieved by AI agents. In addition, SeekerGym also measures how well agents quantify their uncertainty in the completeness of their information; if an agent fails to retrieve all relevant information, it is useful for it to at least quantify how much might be missing. At a high level, each task in SeekerGym is a document (e.g., a Wikipedia article), and the AI agent must issue queries to retrieve passages from that document. Intuitively, the document comprehensively covers a topic, so the ability to retrieve its sections directly measures completeness of information retrieval. In addition to Wikipedia, we also consider machine learning survey papers, where the goal is to retrieve relevant sections of a survey paper. We benchmark several models and algorithms; the best approaches retrieve 42.5% of passages on Wikipedia and 29.2% on ML Surveys, leaving substantial room for improvement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SeekerGym, a benchmark for evaluating the completeness of information retrieved by AI agents on topics. Each task provides a fixed document (Wikipedia article or ML survey paper); agents issue queries to retrieve its passages, with performance measured as the fraction of passages successfully retrieved. The benchmark also assesses agents' uncertainty quantification regarding completeness. Concrete results show the best approaches retrieve 42.5% of passages on Wikipedia and 29.2% on ML surveys.
Significance. SeekerGym supplies a concrete, document-grounded benchmark for a key limitation of current agents (incomplete retrieval that can introduce bias), with reported retrieval rates that quantify the gap. The direct use of ground-truth passages from source documents and the addition of an uncertainty metric are strengths that make the benchmark falsifiable and extensible.
major comments (2)
- [Abstract] Abstract, final paragraph: the evaluation treats retrieval of all sections from a fixed document as a direct measure of 'completeness of information retrieval' on the topic, resting on the unvalidated premise that 'the document comprehensively covers a topic.' No coverage audit, comparison to external sources, or discussion of omitted subtopics/recent developments is described; this assumption is load-bearing for interpreting the 42.5%/29.2% figures as evidence of agent shortcomings rather than document incompleteness.
- [Abstract] Abstract and evaluation description: the reported percentages (42.5% Wikipedia, 29.2% ML Surveys) and uncertainty results lack accompanying details on query formulation procedure, exact passage-matching metric, baseline algorithm implementations, statistical significance tests, or controls for confounds such as query length or document length. These omissions prevent verification that the numbers reliably support the central claim of 'substantial room for improvement.'
minor comments (1)
- [Abstract] Abstract: the phrase 'issue queries to retrieve passages from that document' would benefit from an explicit definition of what constitutes a successful retrieval (e.g., exact match, semantic similarity threshold).
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and outline planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract, final paragraph: the evaluation treats retrieval of all sections from a fixed document as a direct measure of 'completeness of information retrieval' on the topic, resting on the unvalidated premise that 'the document comprehensively covers a topic.' No coverage audit, comparison to external sources, or discussion of omitted subtopics/recent developments is described; this assumption is load-bearing for interpreting the 42.5%/29.2% figures as evidence of agent shortcomings rather than document incompleteness.
Authors: The benchmark is intentionally scoped to measure an agent's ability to retrieve all passages from a single provided document, using that document as the ground-truth reference for the task. The abstract describes this as an 'intuitive' premise rather than a validated claim of exhaustive topic coverage. We agree that Wikipedia articles and survey papers are not guaranteed to be complete and may omit subtopics or recent developments. In revision, we will add an explicit limitations paragraph clarifying that the reported retrieval rates (42.5% and 29.2%) reflect performance relative to the given document's content, not absolute completeness of the underlying topic. This will prevent misinterpretation and make the benchmark's scope clearer. revision: yes
-
Referee: [Abstract] Abstract and evaluation description: the reported percentages (42.5% Wikipedia, 29.2% ML Surveys) and uncertainty results lack accompanying details on query formulation procedure, exact passage-matching metric, baseline algorithm implementations, statistical significance tests, or controls for confounds such as query length or document length. These omissions prevent verification that the numbers reliably support the central claim of 'substantial room for improvement.'
Authors: The full manuscript contains a dedicated methods and evaluation section that specifies query formulation (agent-generated queries to retrieve passages), the passage-matching criterion (successful retrieval of ground-truth sections), baseline implementations, and document selection. However, the abstract is intentionally brief. To address the concern, we will expand the abstract with concise descriptions of these elements and add controls for document length and query characteristics to the evaluation description. We will also incorporate statistical significance testing or confidence intervals in the results tables to better support the claim of substantial room for improvement. revision: partial
Circularity Check
No circularity: benchmark uses direct ground-truth retrieval against fixed documents
full rationale
The paper defines SeekerGym as a benchmark in which agents issue queries to retrieve passages from a fixed source document (Wikipedia article or ML survey), with completeness scored as the fraction of ground-truth passages recovered. This is a direct empirical measurement against externally provided document content, with no equations, fitted parameters, predictions derived from the same data, or self-citations invoked to justify the metric. The stated intuition that 'the document comprehensively covers a topic' is an explicit modeling assumption about the world rather than a self-referential definition or reduction of the result to its inputs. No derivation chain exists that collapses by construction; the evaluation remains self-contained against the chosen ground-truth documents.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The source documents (Wikipedia articles and ML survey papers) comprehensively cover their topics without significant omissions.
Reference graph
Works this paper leans on
-
[1]
Template-dominated structure.Clusters where articles follow near-identical section layouts, testing template recall rather than information seeking (e.g., Tropical Cy- clones & Hurricane Seasons, US Highways & Interstate Routes, Radio & Television Stations, Rail & Transit Systems)
-
[2]
Narrow biographical lists.Clusters consisting primarily of person articles within a single narrow activity, offering limited topical diversity within the cluster (e.g., Cricket Players, Ice Hockey Players, American Football Players, Professional Wrestling, Film Actresses)
-
[3]
Narrow geographic or institutional scope.Clusters tied to a specific nation’s niche institutions with poor generalizability (e.g., British Soap Opera Characters, Romanian Literature, NYC Buildings, Commonwealth Politicians, German & Royal Navy Warships, Military Officers). 13 Preprint. Under review. Table 6: Wikipedia topic-cluster selection for the final...
-
[4]
Redundancy with a selected cluster.Clusters semantically overlapping with an already-selected cluster (e.g., Japanese JRPGs ↔ Video Games, Popular Music Releases↔Music Genres). After curation, we retain 15 topic clusters spanning science & technology, history & society, and arts & entertainment. Table 6 lists all 35 candidate clusters with their inclusion...
-
[5]
Reasoning chain length (reasoning models only).We compare the output reason- ing token count per query step under oracle belief vs. deduplicated trajectory, restricted to reasoning models (GPT-oss-120b, GPT-oss-20b, Qwen3-235B-A22B, Nemotron-3-Nano- 30B). The oracle belief representation produces substantially longer reasoning chains despite adding only a...
work page 2025
-
[6]
Subgoal structure as query scope reduction.The oracle belief representation reveals not just that information is missing butwhere: the location of each gap relative to found passages. This structural information allows the model to scope each query to a specific target region rather than searching globally. Without it, deduplicated trajectory forces the m...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.