SeekerGym: A Benchmark for Reliable Information Seeking

Minseung Lee; Osbert Bastani; Remy Kim; Shuo Li

arxiv: 2604.17143 · v1 · submitted 2026-04-18 · 💻 cs.LG

SeekerGym: A Benchmark for Reliable Information Seeking

Remy Kim , Minseung Lee , Shuo Li , Osbert Bastani This is my paper

Pith reviewed 2026-05-10 06:22 UTC · model grok-4.3

classification 💻 cs.LG

keywords benchmarkinformation retrievalAI agentscompletenessuncertaintyWikipediasurvey papers

0 comments

The pith

SeekerGym measures how completely AI agents retrieve every passage from a full document.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SeekerGym as a benchmark that tests whether AI agents can recover all relevant sections from documents assumed to cover a topic completely. Agents issue queries to pull passages from Wikipedia articles or machine learning survey papers, and performance is scored by the fraction of sections retrieved. Current methods achieve only 42.5 percent recovery on Wikipedia tasks and 29.2 percent on surveys. The benchmark also checks whether agents can report how much information they may have missed. Incomplete retrieval leaves gaps that can bias downstream decisions even when the returned facts are accurate.

Core claim

SeekerGym defines tasks in which an agent must retrieve passages from a complete document by issuing queries, using the document's own sections as the measure of full coverage. On Wikipedia the strongest methods recover 42.5 percent of passages while on ML survey papers they recover 29.2 percent, and agents are scored separately on how well they estimate the completeness of their own retrievals.

What carries the argument

SeekerGym benchmark in which each task supplies a full document and the agent must issue queries to retrieve its passages, with coverage measured directly against the document's sections.

Load-bearing premise

A single document such as a Wikipedia article or survey paper contains every relevant piece of information on its topic.

What would settle it

A controlled test in which an agent retrieves every section of the source document while the document is independently shown to omit key facts on the topic would show the benchmark does not track true completeness.

Figures

Figures reproduced from arXiv: 2604.17143 by Minseung Lee, Osbert Bastani, Remy Kim, Shuo Li.

**Figure 1.** Figure 1: Information seeking under partial observability and its SeekerGym instantiation. (a) Under partial observability, a researcher iteratively asks questions or forms hypotheses, gathers evidence, updates a working belief shaped by curiosity and remaining uncertainty, and decides whether their understanding is sufficient to stop or whether further exploration is needed. (b) SeekerGym instantiates this generic… view at source ↗

**Figure 2.** Figure 2: SeekerGym overview. (a) Retrieval Process. Wikipedia articles are parsed into passages and indexed via embed(·). Given a query at , the Retriever performs vector search over the passage index and returns all goal passages whose embedding similarity to at exceeds threshold θ as the observation ot . (b) Belief Representations. Three representations of the belief state bt : raw trajectory (btrajectory) preser… view at source ↗

**Figure 3.** Figure 3: Mean final completeness ratio (fraction of goal passages retrieved) per Wikipedia [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Completeness ratio by belief representation types on Wikipedia, averaged [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Threshold sensitivity analysis. (a) Aggregate completeness ratio across thresholds. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Mean reasoning token count per query step on Wikipedia under oracle belief vs. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Mean reasoning token count per query step on ML Surveys under oracle belief vs. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: ML Surveys counterpart to Figure [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Mean final completeness ratio by topic area and model on ML Surveys, using the [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Final completeness ratio by model and belief representation on Wikipedia. Models [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Final completeness ratio by model and belief representation on ML Surveys. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Completeness ratio over exploration steps on Wikipedia for each model, broken [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Completeness ratio over exploration steps on ML Surveys for each model, broken [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Why raw trajectory underperforms on Wikipedia. (a) [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Why raw trajectory underperforms on ML Surveys. (a) [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Completeness estimation on Wikipedia: true completeness ratio (x-axis) vs. [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Completeness estimation on ML Surveys: true completeness ratio (x-axis) vs. [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Wikipedia nonconformity-score distributions used for conformal calibration, [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

**Figure 19.** Figure 19: ML Surveys nonconformity-score distributions used for conformal calibration. [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

read the original abstract

Despite their substantial successes, AI agents continue to face fundamental challenges in terms of trustworthiness. Consider deep research agents, tasked with searching for information relevant to a given topic-while AI agents can perform effective information retrieval, there is little guarantee regarding the completeness of this information. Gaps in retrieved information can leave biases that mislead users even if the information they are given is correct and relevant. We introduce SeekerGym, a benchmark designed to evaluate the completeness of information retrieved by AI agents. In addition, SeekerGym also measures how well agents quantify their uncertainty in the completeness of their information; if an agent fails to retrieve all relevant information, it is useful for it to at least quantify how much might be missing. At a high level, each task in SeekerGym is a document (e.g., a Wikipedia article), and the AI agent must issue queries to retrieve passages from that document. Intuitively, the document comprehensively covers a topic, so the ability to retrieve its sections directly measures completeness of information retrieval. In addition to Wikipedia, we also consider machine learning survey papers, where the goal is to retrieve relevant sections of a survey paper. We benchmark several models and algorithms; the best approaches retrieve 42.5% of passages on Wikipedia and 29.2% on ML Surveys, leaving substantial room for improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SeekerGym is a new benchmark that scores AI agents on retrieving all passages from fixed documents like Wikipedia pages and ML surveys, with top methods at 42.5% and 29.2%, plus uncertainty checks, but the completeness claim rests on an unverified assumption that those documents are exhaustive.

read the letter

The punchline is that this paper gives researchers a concrete benchmark for testing whether information-seeking agents pull complete coverage from a source document and whether they can flag their own gaps. The reported numbers show current approaches fall well short, which lines up with the motivation around trustworthiness and bias from missing info. That part is useful and straightforward to understand from the abstract and setup. What is new is the specific framing that turns document retrieval into a completeness task with passage-level scoring and an added uncertainty quantification component. Prior retrieval work focuses more on relevance or ranking, so this direct completeness angle plus the uncertainty angle is a distinct evaluation protocol. The paper does well at laying out the task clearly and running baselines to produce those concrete percentages, which highlight a practical gap without overclaiming. The main soft spot is the core premise that retrieving every section from the document equals completeness on the topic. The abstract calls this intuitive, but Wikipedia articles and survey papers often omit recent developments, counter-views, or narrower subtopics, and the paper does not describe any audit or validation of coverage. If the sources themselves are incomplete, then even perfect retrieval scores would not prove the agent has full information, and the reported shortfalls could partly reflect document limits rather than agent limits. That assumption is load-bearing for the benchmark's interpretation. Other details like exact query generation, the uncertainty metric formula, baseline code, and any statistical significance checks are not visible in the abstract, so the full paper needs to supply those to make the numbers fully reproducible. This work is aimed at people building and evaluating AI agents for research or deep search tasks. A reader focused on agent reliability or retrieval evaluation would find it worth trying as a test suite. It deserves peer review because the problem is real, the benchmark is simple to implement, and the gap it shows is worth exploring further, even if the coverage assumption needs tightening and more implementation transparency.

Referee Report

2 major / 1 minor

Summary. The paper introduces SeekerGym, a benchmark for evaluating the completeness of information retrieved by AI agents on topics. Each task provides a fixed document (Wikipedia article or ML survey paper); agents issue queries to retrieve its passages, with performance measured as the fraction of passages successfully retrieved. The benchmark also assesses agents' uncertainty quantification regarding completeness. Concrete results show the best approaches retrieve 42.5% of passages on Wikipedia and 29.2% on ML surveys.

Significance. SeekerGym supplies a concrete, document-grounded benchmark for a key limitation of current agents (incomplete retrieval that can introduce bias), with reported retrieval rates that quantify the gap. The direct use of ground-truth passages from source documents and the addition of an uncertainty metric are strengths that make the benchmark falsifiable and extensible.

major comments (2)

[Abstract] Abstract, final paragraph: the evaluation treats retrieval of all sections from a fixed document as a direct measure of 'completeness of information retrieval' on the topic, resting on the unvalidated premise that 'the document comprehensively covers a topic.' No coverage audit, comparison to external sources, or discussion of omitted subtopics/recent developments is described; this assumption is load-bearing for interpreting the 42.5%/29.2% figures as evidence of agent shortcomings rather than document incompleteness.
[Abstract] Abstract and evaluation description: the reported percentages (42.5% Wikipedia, 29.2% ML Surveys) and uncertainty results lack accompanying details on query formulation procedure, exact passage-matching metric, baseline algorithm implementations, statistical significance tests, or controls for confounds such as query length or document length. These omissions prevent verification that the numbers reliably support the central claim of 'substantial room for improvement.'

minor comments (1)

[Abstract] Abstract: the phrase 'issue queries to retrieve passages from that document' would benefit from an explicit definition of what constitutes a successful retrieval (e.g., exact match, semantic similarity threshold).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and outline planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract, final paragraph: the evaluation treats retrieval of all sections from a fixed document as a direct measure of 'completeness of information retrieval' on the topic, resting on the unvalidated premise that 'the document comprehensively covers a topic.' No coverage audit, comparison to external sources, or discussion of omitted subtopics/recent developments is described; this assumption is load-bearing for interpreting the 42.5%/29.2% figures as evidence of agent shortcomings rather than document incompleteness.

Authors: The benchmark is intentionally scoped to measure an agent's ability to retrieve all passages from a single provided document, using that document as the ground-truth reference for the task. The abstract describes this as an 'intuitive' premise rather than a validated claim of exhaustive topic coverage. We agree that Wikipedia articles and survey papers are not guaranteed to be complete and may omit subtopics or recent developments. In revision, we will add an explicit limitations paragraph clarifying that the reported retrieval rates (42.5% and 29.2%) reflect performance relative to the given document's content, not absolute completeness of the underlying topic. This will prevent misinterpretation and make the benchmark's scope clearer. revision: yes
Referee: [Abstract] Abstract and evaluation description: the reported percentages (42.5% Wikipedia, 29.2% ML Surveys) and uncertainty results lack accompanying details on query formulation procedure, exact passage-matching metric, baseline algorithm implementations, statistical significance tests, or controls for confounds such as query length or document length. These omissions prevent verification that the numbers reliably support the central claim of 'substantial room for improvement.'

Authors: The full manuscript contains a dedicated methods and evaluation section that specifies query formulation (agent-generated queries to retrieve passages), the passage-matching criterion (successful retrieval of ground-truth sections), baseline implementations, and document selection. However, the abstract is intentionally brief. To address the concern, we will expand the abstract with concise descriptions of these elements and add controls for document length and query characteristics to the evaluation description. We will also incorporate statistical significance testing or confidence intervals in the results tables to better support the claim of substantial room for improvement. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark uses direct ground-truth retrieval against fixed documents

full rationale

The paper defines SeekerGym as a benchmark in which agents issue queries to retrieve passages from a fixed source document (Wikipedia article or ML survey), with completeness scored as the fraction of ground-truth passages recovered. This is a direct empirical measurement against externally provided document content, with no equations, fitted parameters, predictions derived from the same data, or self-citations invoked to justify the metric. The stated intuition that 'the document comprehensively covers a topic' is an explicit modeling assumption about the world rather than a self-referential definition or reduction of the result to its inputs. No derivation chain exists that collapses by construction; the evaluation remains self-contained against the chosen ground-truth documents.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central evaluation rests on one domain assumption about document completeness and contains no free parameters or invented entities.

axioms (1)

domain assumption The source documents (Wikipedia articles and ML survey papers) comprehensively cover their topics without significant omissions.
Invoked to justify that retrieving all sections from the document measures general completeness of information retrieval.

pith-pipeline@v0.9.0 · 5539 in / 1306 out tokens · 57727 ms · 2026-05-10T06:22:08.036390+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

Template-dominated structure.Clusters where articles follow near-identical section layouts, testing template recall rather than information seeking (e.g., Tropical Cy- clones & Hurricane Seasons, US Highways & Interstate Routes, Radio & Television Stations, Rail & Transit Systems)

work page
[2]

Narrow biographical lists.Clusters consisting primarily of person articles within a single narrow activity, offering limited topical diversity within the cluster (e.g., Cricket Players, Ice Hockey Players, American Football Players, Professional Wrestling, Film Actresses)

work page
[3]

13 Preprint

Narrow geographic or institutional scope.Clusters tied to a specific nation’s niche institutions with poor generalizability (e.g., British Soap Opera Characters, Romanian Literature, NYC Buildings, Commonwealth Politicians, German & Royal Navy Warships, Military Officers). 13 Preprint. Under review. Table 6: Wikipedia topic-cluster selection for the final...

work page
[4]

After curation, we retain 15 topic clusters spanning science & technology, history & society, and arts & entertainment

Redundancy with a selected cluster.Clusters semantically overlapping with an already-selected cluster (e.g., Japanese JRPGs ↔ Video Games, Popular Music Releases↔Music Genres). After curation, we retain 15 topic clusters spanning science & technology, history & society, and arts & entertainment. Table 6 lists all 35 candidate clusters with their inclusion...

work page
[5]

deduplicated trajectory, restricted to reasoning models (GPT-oss-120b, GPT-oss-20b, Qwen3-235B-A22B, Nemotron-3-Nano- 30B)

Reasoning chain length (reasoning models only).We compare the output reason- ing token count per query step under oracle belief vs. deduplicated trajectory, restricted to reasoning models (GPT-oss-120b, GPT-oss-20b, Qwen3-235B-A22B, Nemotron-3-Nano- 30B). The oracle belief representation produces substantially longer reasoning chains despite adding only a...

work page 2025
[6]

inappropriate content

Subgoal structure as query scope reduction.The oracle belief representation reveals not just that information is missing butwhere: the location of each gap relative to found passages. This structural information allows the model to scope each query to a specific target region rather than searching globally. Without it, deduplicated trajectory forces the m...

work page 2000

[1] [1]

Template-dominated structure.Clusters where articles follow near-identical section layouts, testing template recall rather than information seeking (e.g., Tropical Cy- clones & Hurricane Seasons, US Highways & Interstate Routes, Radio & Television Stations, Rail & Transit Systems)

work page

[2] [2]

Narrow biographical lists.Clusters consisting primarily of person articles within a single narrow activity, offering limited topical diversity within the cluster (e.g., Cricket Players, Ice Hockey Players, American Football Players, Professional Wrestling, Film Actresses)

work page

[3] [3]

13 Preprint

Narrow geographic or institutional scope.Clusters tied to a specific nation’s niche institutions with poor generalizability (e.g., British Soap Opera Characters, Romanian Literature, NYC Buildings, Commonwealth Politicians, German & Royal Navy Warships, Military Officers). 13 Preprint. Under review. Table 6: Wikipedia topic-cluster selection for the final...

work page

[4] [4]

After curation, we retain 15 topic clusters spanning science & technology, history & society, and arts & entertainment

Redundancy with a selected cluster.Clusters semantically overlapping with an already-selected cluster (e.g., Japanese JRPGs ↔ Video Games, Popular Music Releases↔Music Genres). After curation, we retain 15 topic clusters spanning science & technology, history & society, and arts & entertainment. Table 6 lists all 35 candidate clusters with their inclusion...

work page

[5] [5]

deduplicated trajectory, restricted to reasoning models (GPT-oss-120b, GPT-oss-20b, Qwen3-235B-A22B, Nemotron-3-Nano- 30B)

Reasoning chain length (reasoning models only).We compare the output reason- ing token count per query step under oracle belief vs. deduplicated trajectory, restricted to reasoning models (GPT-oss-120b, GPT-oss-20b, Qwen3-235B-A22B, Nemotron-3-Nano- 30B). The oracle belief representation produces substantially longer reasoning chains despite adding only a...

work page 2025

[6] [6]

inappropriate content

Subgoal structure as query scope reduction.The oracle belief representation reveals not just that information is missing butwhere: the location of each gap relative to found passages. This structural information allows the model to scope each query to a specific target region rather than searching globally. Without it, deduplicated trajectory forces the m...

work page 2000