arxiv: 2504.12516 · v1 · submitted 2025-04-16 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei , Zhiqing Sun , Spencer Papay , Scott McKinney , Jeffrey Han , Isa Fulford , Hyung Won Chung , Alex Tachard Passos

show 2 more authors

William Fedus Amelia Glaese

Authors on Pith no claims yet

Pith reviewed 2026-05-12 07:39 UTC · model grok-4.3

classification 💻 cs.CL

keywords web browsing agentsAI benchmarksinformation retrievalpersistenceevaluation datasetagent capabilitiesshort-answer verification

0 comments

The pith

BrowseComp offers 1,266 short-answer questions to test agents' persistence and creativity while browsing the web for hard-to-find information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BrowseComp as a benchmark of 1,266 questions that force agents to navigate the internet persistently to locate entangled facts. Despite the difficulty, the questions use short, easily verifiable answers so the benchmark stays simple to administer and score. This design targets the core skills of persistence and creativity in information search, drawing an analogy to programming competitions as a useful but incomplete proxy for real coding tasks. The authors note that the benchmark avoids complications like generating long answers or resolving query ambiguity. By focusing on verifiable retrieval, BrowseComp aims to give a clear signal of progress toward capable web-browsing agents.

Core claim

BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information.

What carries the argument

The BrowseComp dataset of 1,266 questions, each engineered to demand repeated web navigation to assemble entangled facts into short verifiable answers.

If this is right

Agents that perform well on BrowseComp demonstrate stronger ability to sustain search effort across multiple steps.
The benchmark supplies a standardized, automatically scorable test that can track iterative improvements in browsing agents.
Success on BrowseComp indicates progress on locating information that is distributed across pages rather than available in a single search.
The dataset can serve as a training signal for agents by rewarding sequences of navigation actions that reach the reference answer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the benchmark proves predictive, researchers could use it to prioritize agent architectures that maintain long search chains over those optimized only for single-step retrieval.
Extending the questions with time-to-answer metrics would let developers measure not just accuracy but also the efficiency of persistence.
The approach could generalize to other domains, such as scientific literature search, where facts are similarly scattered.

Load-bearing premise

Short, easily verifiable answers are enough to measure the persistence and creativity that matter for real browsing.

What would settle it

An experiment showing that agents scoring high on BrowseComp still fail to locate comparable information when the questions are rephrased into open-ended or ambiguous real-world tasks.

read the original abstract

We present BrowseComp, a simple yet challenging benchmark for measuring the ability for agents to browse the web. BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information. BrowseComp can be found at https://github.com/openai/simple-evals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BrowseComp gives a new 1,266-question set for testing agent persistence on scattered web facts with short verifiable answers, but its strength depends on unseen construction details.

read the letter

The main thing to know is that this paper releases BrowseComp, a benchmark of 1,266 questions meant to measure how well agents persist when hunting for hard-to-find, interconnected information on the web. It keeps scoring easy by sticking to short answers that match reference ones directly. The GitHub link in the abstract makes it immediately usable, which is practical for anyone running agent evals right now. The programming-competition analogy is straightforward and honest: the benchmark isolates navigation persistence without claiming to handle long-form output or query ambiguity. That scoping is clear and avoids overreach. No equations or fitted models appear, so the work is just the question set plus the framing. Citation patterns look ordinary for a benchmark paper. The soft spot is the lack of visible detail on how the questions were sourced, filtered, or checked for actual difficulty and persistence demands. The abstract states the intent but does not show validation steps or inter-rater checks, so it is still unclear how well the set truly isolates the targeted skill versus just being a collection of tricky trivia. If the full paper supplies that methodology, the claim holds better; otherwise the central measurement story stays thin. This is aimed at people building or evaluating web agents who need a narrow test for information-seeking persistence. Readers working on agent benchmarks or retrieval systems will get the most out of the question set itself. It deserves a serious referee because new, scoped benchmarks can steer agent work even when narrow, provided the construction is solid. I would send it for review with a request for the question-creation details.

Referee Report

2 major / 1 minor

Summary. The paper presents BrowseComp, a benchmark of 1,266 questions intended to evaluate web-browsing agents on their ability to persistently navigate the internet in search of hard-to-find, entangled information. Answers are short and easily verifiable against references, and the benchmark is explicitly framed as an incomplete but useful proxy (analogous to programming competitions) that isolates the core capabilities of persistence and creativity while deliberately avoiding long-form output and ambiguity.

Significance. If the questions are shown to be well-constructed and to genuinely require the claimed navigation behaviors, BrowseComp could become a practical, reproducible standard for measuring a key agent capability. The public GitHub release supports immediate use and extension by the community.

major comments (2)

[Abstract] Abstract: the central claim that the 1,266 questions 'require persistently navigating the internet in search of hard-to-find, entangled information' and thereby measure persistence and creativity is unsupported by any description of question sourcing, validation, difficulty calibration, or inter-rater agreement.
[Abstract] The design decision to use only short, verifiable answers is presented as sufficient to isolate the target capability, yet no evidence or analysis is supplied showing that this format actually elicits (rather than bypasses) the persistence and creativity the benchmark claims to measure.

minor comments (1)

[Abstract] The GitHub link is given, but the paper would benefit from one or two concrete question examples to illustrate the intended difficulty and verification process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on our manuscript. We agree that the abstract would benefit from additional supporting details and have revised the paper accordingly to strengthen the presentation of the benchmark.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the 1,266 questions 'require persistently navigating the internet in search of hard-to-find, entangled information' and thereby measure persistence and creativity is unsupported by any description of question sourcing, validation, difficulty calibration, or inter-rater agreement.

Authors: We agree that the abstract lacks explicit details on these aspects. The full manuscript contains a Benchmark Construction section that describes sourcing questions from publicly available web content requiring multi-page navigation to resolve entangled facts, followed by manual verification of each answer against the source material and iterative difficulty calibration via pilot runs with baseline agents. We will revise the abstract to briefly reference this process and expand the main text with additional specifics on validation. Inter-rater agreement is not applicable in the conventional sense because each question has a single, objectively verifiable short answer; we will add a clarifying note on this point. revision: yes
Referee: [Abstract] The design decision to use only short, verifiable answers is presented as sufficient to isolate the target capability, yet no evidence or analysis is supplied showing that this format actually elicits (rather than bypasses) the persistence and creativity the benchmark claims to measure.

Authors: The short-answer format is chosen precisely to isolate persistence and information-seeking from the separate challenges of long-form generation and ambiguity, consistent with the programming-competition analogy stated in the paper. The manuscript already includes example questions (in the main text and appendix) that illustrate the need for multi-step browsing. To provide stronger evidence, we will add a new analysis subsection reporting quantitative metrics on agent behavior, such as the distribution of page visits and tool calls for solved versus unsolved questions, demonstrating that high performance correlates with persistent navigation rather than shortcuts. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark definition with no derivations or self-referential reductions

full rationale

The paper introduces BrowseComp as a collection of 1,266 questions without any equations, fitted parameters, predictions, or derivation chain. It directly defines the benchmark, notes its analogy to programming competitions as an illustrative framing, and states its scope limitations explicitly. No self-citations, ansatzes, or uniqueness claims reduce any result to its own inputs by construction. The central claim—that short verifiable answers isolate persistence and creativity—is presented as a design choice rather than a derived theorem, making the work self-contained by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's contribution rests on the creation of the question set itself rather than on fitted parameters or new theoretical entities.

axioms (1)

domain assumption The selected questions require persistent navigation and creativity to solve.
This premise is stated directly in the abstract as the defining property of the benchmark.

pith-pipeline@v0.9.0 · 5448 in / 1147 out tokens · 46853 ms · 2026-05-12T07:39:42.916354+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/LogicAsFunctionalEquation.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information... measures the important core capability of exercising persistence and creativity in finding information.
IndisputableMonolith/Foundation/DAlembert/Inevitability.lean bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents.
IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 38 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
cs.CL 2026-05 unverdicted novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
cs.AI 2026-04 accept novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
cs.CL 2026-04 unverdicted novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perf...
TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over str...
TeamBench: Evaluating Agent Coordination under Enforced Role Separation
cs.AI 2026-05 unverdicted novelty 7.0

Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.
Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents
cs.CL 2026-05 unverdicted novelty 7.0

A new framework parses and evaluates citations in LLM deep research reports across link validity, relevance, and factuality, finding 94%+ link success but only 39-77% factual accuracy.
Inference-Time Budget Control for LLM Search Agents
cs.AI 2026-05 unverdicted novelty 7.0

A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
PIIGuard: Mitigating PII Harvesting under Adversarial Sanitization
cs.CR 2026-05 conditional novelty 7.0

PIIGuard uses optimized hidden HTML fragments on webpages to block LLMs from leaking contact PII via indirect prompt injection, achieving at least 97% defense success across tested models while preserving benign QA utility.
GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation
cs.CL 2026-04 unverdicted novelty 7.0

A new workflow for multilingual agent benchmark adaptation using functional, cultural, and difficulty alignments improves non-English agent success rates by up to 32.7% over simple machine translation, indicating subs...
WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark
cs.AI 2026-04 unverdicted novelty 7.0

WebForge is an automated multi-agent framework that creates realistic and reproducible browser agent benchmarks at scale, demonstrated via a 934-task benchmark that reveals distinct model capability profiles through m...
DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?
cs.AI 2026-04 unverdicted novelty 7.0

DRBENCHER generates multi-hop questions across biochemistry, finance, geophysics, security, and history that test interleaved browsing and computation, where the strongest models reach only 20% accuracy and human vali...
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces
cs.CL 2026-04 unverdicted novelty 7.0

GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
Evaluating the Search Agent in a Parallel World
cs.AI 2026-03 unverdicted novelty 7.0

Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping ...
Revisiting DAgger in the Era of LLM-Agents
cs.LG 2026-05 conditional novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
cs.LG 2026-05 unverdicted novelty 6.0

MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents
cs.MA 2026-05 unverdicted novelty 6.0

Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-...
DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data
cs.LG 2026-04 unverdicted novelty 6.0

A 4B deep research agent trained on 10K open data outperforms prior agents under 9B parameters and narrows the gap to 30B-class systems on research benchmarks.
MARCA: A Checklist-Based Benchmark for Multilingual Web Search
cs.CL 2026-04 accept novelty 6.0

MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.
Towards Long-horizon Agentic Multimodal Search
cs.CV 2026-04 unverdicted novelty 6.0

LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
cs.AI 2026-04 unverdicted novelty 6.0

Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
Towards Knowledgeable Deep Research: Framework and Benchmark
cs.AI 2026-04 unverdicted novelty 6.0

The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving
cs.CL 2026-04 unverdicted novelty 6.0

TEC is a new public dataset of detailed human trial-and-error trajectories and reflections on web tasks, with humans showing substantially higher accuracy than LLMs.
LightThinker++: From Reasoning Compression to Memory Management
cs.CL 2026-04 unverdicted novelty 6.0

LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
Real-Time Execution of Action Chunking Flow Policies
cs.RO 2025-06 unverdicted novelty 6.0

Real-time chunking (RTC) allows diffusion- and flow-based action chunking policies to execute smoothly and asynchronously, maintaining high success rates on dynamic tasks even with significant inference latency.
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
cs.CV 2026-05 unverdicted novelty 5.0

ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.
Mind DeepResearch Technical Report
cs.AI 2026-04 unverdicted novelty 5.0

MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
AlphaEval: Evaluating Agents in Production
cs.CL 2026-04 unverdicted novelty 5.0

AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.
Seed1.8 Model Card: Towards Generalized Real-World Agency
cs.AI 2026-03 unverdicted novelty 5.0

Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
GLM-5: from Vibe Coding to Agentic Engineering
cs.LG 2026-02 unverdicted novelty 5.0

GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
MiMo-V2-Flash Technical Report
cs.CL 2026-01 unverdicted novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
cs.AI 2025-09 conditional novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
cs.CL 2026-05 unverdicted novelty 4.0

This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a...
Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration
cs.AI 2026-04 unverdicted novelty 4.0

A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
cs.CL 2025-08 unverdicted novelty 4.0

GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
cs.AI 2025-07 accept novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.