arxiv: 2604.06177 · v1 · submitted 2026-02-03 · 💻 cs.IR · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

WebExpert: domain-aware web agents with critic-guided expert experience for high-precision search

Yuelin Hu , Zhengxue Cheng , Ronghua Wu , Qunshan Gu , Hongwei Hu , Wei Liu , Qiao Liang , Li Song

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:00 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL

keywords web agentsdomain-aware searchexperience retrievalfacet inductionpreference optimizationspecialized web tasksGAIA benchmarkweb navigation

0 comments

The pith

WebExpert agents retrieve sentence-level expert experience and induce dynamic facets to raise exact-match accuracy on specialized web searches by 1.5-3.6 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Specialized searches in finance, biomedicine, and similar fields lose precision because agents lack reliable domain knowledge and drift into noisy evidence. WebExpert counters this by pulling relevant past sentences, merging topics, distilling rules, and bootstrapping facets such as time, region, and industry from weak labels instead of fixed dictionaries. Preference learning then refines both planning and retrieval in one step, with a light gate that favors useful facets at inference. The result on GAIA, GPQA, HLE, and WebWalkerQA is higher answer exact match and fewer page visits than strong browsing baselines. A sympathetic reader would care because the method shows how lightweight, learned priors can make agents more effective on real domain tasks without heavy manual engineering.

Core claim

WebExpert equips web agents with sentence-level experience retrieval that merges topics and distills rules, schema-light facet induction that learns time-region-policy-industry facets from weak supervision, and preference-optimized planning that improves query planning and retrieval together through pairwise preferences and a coverage objective. At inference a lightweight experience gate biases decoding toward active facets with fallback on low . These pieces together produce 1.5-3.6 point gains in answer exact match and fewer page hops on GAIA, GPQA, HLE, and WebWalkerQA.

What carries the argument

Critic-guided expert experience retrieval paired with schema-light facet induction that supplies dynamic domain priors to preference-optimized planning.

If this is right

Agents achieve higher exact-match accuracy on finance, biomedicine, and pharmaceutical queries without static domain lexicons.
Planning and retrieval improve jointly when trained with pairwise preference signals and a coverage objective.
Fewer page hops result from biasing decoding toward retrieved facets at inference time.
Ablations confirm gains from retrieval, topic merging, facet induction, and preference training each contribute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same experience-retrieval and facet-induction loop could be added to non-web agents that face domain-specific evidence.
If weak-supervision facet induction generalizes, it lowers the cost of building agents for narrow verticals.
Real-world noisy queries may benefit more than benchmarks once the experience gate learns reliable fallback thresholds.
The method suggests experience modules can act as reusable, updatable domain memory across successive agent runs.

Load-bearing premise

Sentence-level experience retrieval with topic merging, rule distillation, and facet induction from weak supervision can reliably provide useful domain priors without causing query drift or needing hand-written lexicons.

What would settle it

Test WebExpert on a new specialized-domain benchmark where no similar past experiences exist; if exact-match gains disappear or page hops rise, the experience components are not supplying reliable priors.

read the original abstract

Specialized web tasks in finance, biomedicine, and pharmaceuticals remain challenging due to missing domain priors: queries drift, evidence is noisy, and reasoning is brittle. We present WebExpert, a domain-aware web agent that we implement end-to-end, featuring : (i) sentence-level experience retrieval with topic merging and rule distillation, (ii) schemalight facet induction that bootstraps time,region,policy,industry facets from weak supervision instead of static hand-written lexicons, and (iii) preference-optimized planning that jointly improves query planning and retrieval via pairwise preference learning alongside a coverage-aware objective. At inference, a lightweight experience gate biases decoding toward active facets with fallback under low-retrieval confidence. On GAIA, GPQA, HLE, and WebWalkerQA, WebExpert improves Answer Exact Match (EM) by 1.5-3.6 pp over the strongest browsing baseline and reduces page hops. Analysis shows consistent gains and ablations on retrieval, topic merging, facet induction, and preference-aware training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WebExpert combines experience retrieval and auto facet induction into a web agent with modest benchmark gains, but the domain-specific motivation is not tested.

read the letter

The main takeaway is that this paper gives a concrete way to add sentence-level experience retrieval, topic merging, rule distillation, and schemalight facet induction to web agents, then optimizes the planning step with pairwise preferences. It reports 1.5-3.6 point lifts on exact match and fewer hops on GAIA, GPQA, HLE, and WebWalkerQA. The integration of those pieces looks new relative to the baselines they cite, and the inference-time gate is a simple practical touch. They also run ablations on retrieval, merging, facets, and the preference training, which helps show what moves the needle. That part is useful for anyone building agents that need to reuse past traces without hand-written rules. The soft spots are straightforward. The gains stay small and come without error bars or statistical tests, so it is unclear how stable they are across runs or splits. More critically, the opening motivation focuses on finance, biomedicine, and pharma where domain priors are missing and query drift is the real risk, yet all the numbers come from general QA benchmarks. Nothing in the reported results checks whether the facets or retrieved experiences actually prevent drift on specialized queries. This leaves the central claim about reliable domain priors unverified. The paper is aimed at IR and agent researchers who want templates for memory-augmented browsing. Someone working on general web agents could pick up the architecture and the preference objective. For high-stakes domain tools the current evidence is too thin. I would send it to peer review so referees can press on the benchmark choice and ask for domain-specific tests.

Referee Report

3 major / 2 minor

Summary. The manuscript presents WebExpert, an end-to-end domain-aware web agent for specialized search tasks. It introduces sentence-level experience retrieval with topic merging and rule distillation, schemalight facet induction from weak supervision (bootstrapping time/region/policy/industry facets), and preference-optimized planning that combines pairwise preference learning with a coverage-aware objective. An experience gate biases decoding at inference. On GAIA, GPQA, HLE, and WebWalkerQA the method reports 1.5-3.6 pp gains in Answer Exact Match over the strongest browsing baseline together with fewer page hops; ablations on retrieval, merging, facet induction, and preference training are provided.

Significance. If the gains hold under rigorous testing, the approach would offer a practical route to injecting domain priors into web agents without static lexicons, potentially improving precision on noisy specialized queries. The joint preference optimization of planning and retrieval and the critic-guided experience mechanism are technically distinctive and could influence subsequent agent work. The current evidence, however, is limited to general benchmarks and small absolute improvements, so the assessed significance remains moderate.

major comments (3)

[Abstract] Abstract and Evaluation section: the motivating claim is that the components supply reliable domain priors for finance/biomedicine/pharmaceutical queries without causing drift, yet all reported results are confined to GAIA, GPQA, HLE, and WebWalkerQA; no experiments on domain-specific queries from the motivating areas are described, leaving the central assumption unverified.
[Results] Results section: the 1.5-3.6 pp EM gains are presented without error bars, baseline implementation details, or statistical significance tests; given the modest absolute margins, it is unclear whether the improvements are robust or sensitive to post-hoc choices and data splits.
[Method] Method section on schemalight facet induction and experience gate: these mechanisms are asserted to avoid query drift while supplying domain priors, but the manuscript contains no direct measurement (e.g., drift rate or precision on specialized queries) that would substantiate the claim under the conditions where drift risk is highest.

minor comments (2)

[Abstract] The abstract states 'consistent gains and ablations' but does not quantify the magnitude of the ablation drops or identify which component contributes most to the reported gains.
Notation for the induced facets (time, region, policy, industry) and the experience gate threshold would benefit from a concrete example or pseudocode in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions where appropriate to strengthen the presentation of results and claims.

read point-by-point responses

Referee: [Abstract] Abstract and Evaluation section: the motivating claim is that the components supply reliable domain priors for finance/biomedicine/pharmaceutical queries without causing drift, yet all reported results are confined to GAIA, GPQA, HLE, and WebWalkerQA; no experiments on domain-specific queries from the motivating areas are described, leaving the central assumption unverified.

Authors: We acknowledge that the primary quantitative results use established general-purpose benchmarks. These benchmarks contain a substantial fraction of queries that require specialized domain knowledge (e.g., scientific, technical, and policy-related questions in GAIA and GPQA). In the revision we will (i) add an explicit mapping in the evaluation section showing which benchmark subsets align with the motivating domains, (ii) include qualitative examples illustrating facet induction and experience retrieval on such queries, and (iii) report a small-scale drift analysis on a held-out set of domain-flavored queries. We therefore mark this as a partial revision focused on clarification and supporting analysis rather than entirely new large-scale experiments. revision: partial
Referee: [Results] Results section: the 1.5-3.6 pp EM gains are presented without error bars, baseline implementation details, or statistical significance tests; given the modest absolute margins, it is unclear whether the improvements are robust or sensitive to post-hoc choices and data splits.

Authors: We agree that the current presentation lacks sufficient statistical detail. In the revised results section and appendix we will add (i) error bars computed over at least three independent runs with different random seeds, (ii) complete hyper-parameter tables and implementation notes for every baseline, and (iii) statistical significance tests (paired t-test and bootstrap confidence intervals) on the Exact Match differences. These additions will directly address concerns about robustness and sensitivity to data splits. revision: yes
Referee: [Method] Method section on schemalight facet induction and experience gate: these mechanisms are asserted to avoid query drift while supplying domain priors, but the manuscript contains no direct measurement (e.g., drift rate or precision on specialized queries) that would substantiate the claim under the conditions where drift risk is highest.

Authors: We accept that direct quantitative evidence of drift mitigation is currently missing. We will add a new analysis subsection that defines and reports a drift metric (cosine similarity between original and expanded query embeddings plus manual annotation of drift cases) for both the schemalight facet induction and the experience gate. We will also evaluate precision on a small curated set of finance- and biomedicine-style queries to provide the requested substantiation under high-drift-risk conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on public benchmarks

full rationale

The paper describes an implemented agent system using sentence-level experience retrieval with topic merging and rule distillation, schemalight facet induction from weak supervision, and preference-optimized planning with an experience gate at inference. It reports empirical Answer Exact Match gains of 1.5-3.6 pp on GAIA, GPQA, HLE, and WebWalkerQA versus browsing baselines, plus reduced page hops, with ablations on the components. No equations, derivations, or self-citations appear in the provided text that reduce any claim to fitted parameters or inputs by construction. The central results are framed as measured improvements from the described techniques rather than self-referential predictions or renamed fits.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on standard assumptions about LLM planning and retrieval capabilities plus the effectiveness of weak supervision for facet induction; no new entities are postulated and no explicit free parameters are named in the abstract.

axioms (2)

domain assumption LLMs can perform effective sentence-level retrieval and topic merging when given appropriate prompts
Invoked by the experience retrieval and planning components
domain assumption Weak supervision signals suffice to induce useful time/region/policy/industry facets without hand-crafted lexicons
Central to the schemalight facet induction claim

pith-pipeline@v0.9.0 · 5504 in / 1388 out tokens · 36935 ms · 2026-05-16T08:00:42.836534+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

sentence-level experience retrieval with topic merging and rule distillation; schemalight facet induction that bootstraps time/region/policy/industry facets
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

preference-optimized planning that jointly improves query planning and retrieval via pairwise preference learning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 5 internal anchors

[1]

Without expert priors, agents formulate off- target queries, wander to irrelevant pages, and miss evidence

INTRODUCTION Web browsing agents have shown strong results on open- ended tasks, yet their effectiveness drops in domain-specific scenarios (e.g., credit approval in finance, clinical guidance in biomedicine). Without expert priors, agents formulate off- target queries, wander to irrelevant pages, and miss evidence. In practice, domain practitioners atten...

work page
[2]

WebExpert: domain-aware web agents with critic-guided expert experience for high-precision search

RELA TED WORK Web agents and deep research.Large reasoning models (LRMs) integrated with search and browsing have shown strong capabilities in complex tasks. Reason-then-search systems (e.g., search-o1 [1]) couple agentic retrieval with in-document reasoning to iteratively refine external knowl- arXiv:2604.06177v1 [cs.IR] 3 Feb 2026 Fig. 1. Overview of We...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

CFA Institute

METHOD 3.1. Problem Setup We consider domain-specific web tasks where an agent must generate search queries, browse the web, and synthesize an answerafor a questionq. We assume access to a curated expert experience baseE={r i}N i=1 of sentence-level rules distilled from expert corpora. LetE (k) denote the top-kre- trieved experiences forq(see Sec. 3.3). T...

work page
[4]

Setup Datasets.We evaluate on GAIA, GPQA, HLE

EXPERIMENTS 4.1. Setup Datasets.We evaluate on GAIA, GPQA, HLE. Each includes open-domain and domain-focused subsets; we report overall results. We additionally evaluate on WebWalkerQA, a bench- mark for multi-step web browsing and grounded question an- swering. WebWalkerQA includes hundreds of tasks across real-world domains and requires page navigation ...

work page
[5]

ACKNOWLEDGMENT This work was partly supported by the NSFC (62431015, 62571317, 62501387), the Fundamental Research Funds for the Central Universities, Shanghai Key Laboratory of Digital Media Processing and Transmission under Grant 22DZ2229005, 111 project BP0719010

work page
[6]

Experiments on GAIA, GPQA, and HLE show consistent 1.5–3.6 pp gains and improved ef- ficiency

CONCLUSION We proposed WebExpert, a critic-guided, domain-aware web agent that retrieves expert experiences to ground query gener- ation before deep browsing. Experiments on GAIA, GPQA, and HLE show consistent 1.5–3.6 pp gains and improved ef- ficiency. Our analysis highlights the importance of sentence- level retrieval, topic merging, and SFT for domain fidelity

work page
[7]

Search-o1: Reason-then-search web agents,

OpenAI, “Search-o1: Reason-then-search web agents,” 2024. arXiv preprint

work page 2024
[8]

WebThinker: Deep browsing with iterative query planning,

L. Wen et al., “WebThinker: Deep browsing with iterative query planning,” 2025. arXiv preprint

work page 2025
[9]

Direct Preference Opti- mization: Your Language Model is Secretly a Reward Model,

R. Rafailov, A. Kumar, T. Xiao, et al., “Direct Preference Opti- mization: Your Language Model is Secretly a Reward Model,” NeurIPS, 2023

work page 2023
[10]

Retrieval-Augmented Generation for Knowledge-Intensive NLP,

P. Lewis, E. Perez, A. Karpukhin, et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP,” NeurIPS, 2020

work page 2020
[11]

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

M. Grootendorst, “BERTopic: Neural topic modeling with a class-based TF-IDF procedure,” arXiv preprint arXiv:2203.05794, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

YaRN: Efficient Context Window Extension of Large Language Models

BAAI, “BGE/FlagEmbedding: Bilingual General Embed- dings,” arXiv preprint arXiv:2309.00071, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

GPQA: A graduate-level Google-proof QA benchmark,

M. Rein et al., “GPQA: A graduate-level Google-proof QA benchmark,” arXiv preprint arXiv:2306.12345, 2023

work page arXiv 2023
[14]

GAIA: General AI Assistant benchmark for web tasks,

S. Kumar et al., “GAIA: General AI Assistant benchmark for web tasks,” arXiv preprint arXiv:2401.22222, 2024

work page arXiv 2024
[15]

Humanity's Last Exam

Phan L, Gatti A, Han Z, et al. Humanity’s last exam[J]. arXiv preprint arXiv:2501.14249, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

QwQ-32B: A strong reasoning model,

Qwen Team, “QwQ-32B: A strong reasoning model,” Techni- cal Report, 2024

work page 2024
[17]

ReAct: Synergizing reason- ing and acting in language models,

Y . Yao, D. Zhao, S. Yang, et al., “ReAct: Synergizing reason- ing and acting in language models,” NeurIPS, 2023

work page 2023
[18]

Billion-scale similarity search with GPUs,

J. Johnson, M. Douze, H. J ´egou, “Billion-scale similarity search with GPUs,” IEEE TPAMI, 2019

work page 2019
[19]

Efficient and robust approximate nearest neighbor search using HNSW,

Y . Malkov, D. Yashunin, “Efficient and robust approximate nearest neighbor search using HNSW,” IEEE TPAMI, 2020

work page 2020
[20]

The probabilistic relevance frame- work: BM25 and beyond,

S. Robertson, H. Zaragoza, “The probabilistic relevance frame- work: BM25 and beyond,” Found. Trends IR, 2009

work page 2009
[21]

Dense Passage Re- trieval for Open-Domain Question Answering,

V . Karpukhin, B. Oguz, S. Min, et al., “Dense Passage Re- trieval for Open-Domain Question Answering,” EMNLP, 2020

work page 2020
[22]

Self-RAG: Learning to retrieve, generate, and critique,

S. Asai, K. Hashimoto, R. Socher, et al., “Self-RAG: Learning to retrieve, generate, and critique,” ICLR, 2024

work page 2024
[23]

UniGen: Unified retrieval- augmented generation,

S. Shi, X. Yao, M. Yu, et al., “UniGen: Unified retrieval- augmented generation,” ICML, 2024

work page 2024
[24]

Improving lan- guage models by retrieving from trillions of tokens,

X. Borgeaud, A. Mensch, J. Hoffmann, et al., “Improving lan- guage models by retrieving from trillions of tokens,” ICML, 2022

work page 2022
[25]

Visualizing data using t-SNE,

L. van der Maaten, G. Hinton, “Visualizing data using t-SNE,” JMLR, 2008

work page 2008
[26]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

L. McInnes, J. Healy, J. Melville, “UMAP: Uniform Manifold Approximation and Projection,” arXiv:1802.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Density-based clustering based on hierarchical density estimates,

R. Campello, D. Moulavi, J. Sander, “Density-based clustering based on hierarchical density estimates,” PAKDD, 2013

work page 2013
[28]

The use of MMR, diversity- based reranking for reordering documents and producing sum- maries,

J. Carbonell, J. Goldstein, “The use of MMR, diversity- based reranking for reordering documents and producing sum- maries,” SIGIR, 1998

work page 1998
[29]

Retrieval-Augmented Generation for Large Language Models: A Survey,

W. Zhao, Z. Chen, Y . Xiong, et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” 2024

work page 2024
[30]

Meta-RAG: Memory Enhanced Retrieval-Augmented Generation,

Y . Jin, R. Li, D. Sachan, et al., “Meta-RAG: Memory Enhanced Retrieval-Augmented Generation,” ACL, 2024

work page 2024
[31]

Pai-Megatron-Patch: Megatron- LM compatible training framework,

Pai-Megatron-Patch Team, “Pai-Megatron-Patch: Megatron- LM compatible training framework,” WeChat Official Account technical report, 2024

work page 2024