SeekerGym is a new benchmark that measures how completely AI agents retrieve information from full documents and how well they quantify uncertainty about missing parts, with top methods achieving only 42.5% recall on Wikipedia and 29.2% on ML surveys.
deduplicated trajectory, restricted to reasoning models (GPT-oss-120b, GPT-oss-20b, Qwen3-235B-A22B, Nemotron-3-Nano- 30B)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
SeekerGym: A Benchmark for Reliable Information Seeking
SeekerGym is a new benchmark that measures how completely AI agents retrieve information from full documents and how well they quantify uncertainty about missing parts, with top methods achieving only 42.5% recall on Wikipedia and 29.2% on ML surveys.