arxiv: 2511.23281 · v1 · submitted 2025-11-28 · 💻 cs.CL

MCP vs RAG vs NLWeb vs HTML: A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web (Technical Report)

Aaron Steiner , Ralph Peeters , Christian Bizer This is my paper

Pith reviewed 2026-05-17 04:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM agentsweb agent interfacesRAGMCPNLWebHTML browsinge-commerce automationagent evaluation

0 comments p. Extension

The pith

RAG, MCP and NLWeb interfaces let LLM web agents outperform direct HTML browsing in accuracy and speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares how LLM agents perform when accessing websites through four different interfaces: raw HTML, retrieval-augmented generation over pre-crawled content, the Model Context Protocol for API communication, and the NLWeb natural language interface. Using a testbed of four simulated e-shops, the authors create specialized agents for each interface and have them carry out tasks ranging from simple product searches to complex checkout processes. Results show that RAG, MCP, and NLWeb agents achieve higher F1 scores and require less time and fewer tokens than HTML agents across multiple language models. This demonstrates that the interface chosen for web interaction has a major influence on agent effectiveness and efficiency.

Core claim

The evaluation on simulated e-shops shows that RAG, MCP, and NLWeb agents outperform HTML agents, with average F1 scores increasing from 0.67 to 0.75-0.77, token usage dropping from 241k to 47k-140k, and runtime falling from 291 seconds to 50-62 seconds per task.

What carries the argument

The controlled testbed consisting of four simulated e-shops exposing the same products via HTML, MCP, and NLWeb interfaces, along with interface-specific agents evaluated on identical task sets.

If this is right

RAG, MCP, and NLWeb improve both the success rate and resource efficiency of LLM agents on web-based tasks such as product comparison and checkout.
The best configuration combines RAG with GPT 5 for top F1 and completion rates.
Considering costs, RAG with GPT 5 mini provides a balanced option.
Interface selection can substantially change the practicality of deploying web agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future web agents may shift away from HTML scraping toward structured or natural language interfaces to achieve better results.
Real-world websites could benefit from offering MCP or NLWeb-style access to improve compatibility with AI tools.
These efficiency gains might enable more complex agent workflows that were previously too costly or slow.

Load-bearing premise

The simulated e-shops and their interfaces accurately represent the challenges of real, dynamic websites that may include anti-bot measures or changing content.

What would settle it

Testing the same agents on actual public e-commerce sites to check whether the reported gains in F1 score, token savings, and speed remain consistent.

Figures

Figures reproduced from arXiv: 2511.23281 by Aaron Steiner, Christian Bizer, Ralph Peeters.

**Figure 1.** Figure 1: gives an overview of the four architectures. What is still missing is a systematic comparison of the effectiveness and efficiency of these architectures using identical sets of challenging tasks [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Price (log scale) and performance across architecture and model combinations. Top left is best, meaning higher F1 and [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Large language model agents are increasingly used to automate web tasks such as product search, offer comparison, and checkout. Current research explores different interfaces through which these agents interact with websites, including traditional HTML browsing, retrieval-augmented generation (RAG) over pre-crawled content, communication via Web APIs using the Model Context Protocol (MCP), and natural-language querying through the NLWeb interface. However, no prior work has compared these four architectures within a single controlled environment using identical tasks. To address this gap, we introduce a testbed consisting of four simulated e-shops, each offering its products via HTML, MCP, and NLWeb interfaces. For each interface (HTML, RAG, MCP, and NLWeb) we develop specialized agents that perform the same sets of tasks, ranging from simple product searches and price comparisons to complex queries for complementary or substitute products and checkout processes. We evaluate the agents using GPT 4.1, GPT 5, GPT 5 mini, and Claude Sonnet 4 as underlying LLM. Our evaluation shows that the RAG, MCP and NLWeb agents outperform HTML on both effectiveness and efficiency. Averaged over all tasks, F1 rises from 0.67 for HTML to between 0.75 and 0.77 for the other agents. Token usage falls from about 241k for HTML to between 47k and 140k per task. The runtime per task drops from 291 seconds to between 50 and 62 seconds. The best overall configuration is RAG with GPT 5 achieving an F1 score of 0.87 and a completion rate of 0.79. Also taking cost into consideration, RAG with GPT 5 mini offers a good compromise between API usage fees and performance. Our experiments show the choice of the interaction interface has a substantial impact on both the effectiveness and efficiency of LLM-based web agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces a controlled testbed of four simulated e-shops and compares four interfaces (HTML, RAG, MCP, NLWeb) for LLM web agents. Specialized agents for each interface are evaluated on tasks ranging from simple searches to complex multi-step queries and checkout, using GPT-4.1, GPT-5, GPT-5-mini, and Claude Sonnet 4. The central empirical claim is that RAG, MCP, and NLWeb outperform raw HTML, raising average F1 from 0.67 to 0.75–0.77, cutting token usage from ~241k to 47k–140k, and reducing runtime from 291s to 50–62s per task, with RAG+GPT-5 best overall and RAG+GPT-5-mini a cost-effective option.

Significance. If the quantitative results hold, the work supplies direct, reproducible evidence that interface choice materially affects both effectiveness and efficiency of LLM web agents inside a fixed simulated environment. The use of multiple LLMs, graded task complexity, and explicit cost discussion are strengths. The findings are useful for practitioners selecting agent architectures, though the simulated e-shops limit immediate generalization to live, dynamic sites.

major comments (1)

Evaluation section (and abstract): exact success criteria, agent prompts, and the precise definition of F1 for multi-step tasks (including how partial progress or error recovery is scored) are not fully specified. Without these details the reported F1 deltas cannot be independently verified or compared to other work.

minor comments (2)

The paper should explicitly state the limitations of the simulated e-shops (e.g., absence of anti-bot measures, JavaScript dynamics, or layout drift) so readers understand the scope of the claimed interface advantages.
Table or figure captions should include the exact number of tasks per complexity level and the number of runs per LLM-interface pair to clarify the statistical basis of the averages.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. The quantitative comparison of interfaces for LLM web agents is the core contribution, and we appreciate the emphasis on reproducibility. We address the major comment below.

read point-by-point responses

Referee: [—] Evaluation section (and abstract): exact success criteria, agent prompts, and the precise definition of F1 for multi-step tasks (including how partial progress or error recovery is scored) are not fully specified. Without these details the reported F1 deltas cannot be independently verified or compared to other work.

Authors: We agree that these elements must be specified in full for independent verification. The current manuscript outlines the task categories and reports aggregate F1, token, and runtime metrics but does not include the verbatim agent prompts or the exact F1 computation for multi-step tasks. In the revised version we will expand the Evaluation section with: (1) the complete success criteria (a sub-task is successful if the required information or action is obtained without hallucination); (2) the full system and user prompts used for each of the four interfaces; and (3) the precise F1 definition, where precision and recall are computed over the set of completed sub-goals, partial progress receives proportional credit, and error-recovery steps are not double-penalized once the goal is reached. These additions will make the reported deltas (0.67 to 0.75–0.77) directly reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct experimental measurements

full rationale

The paper presents an empirical comparison of four web agent interfaces (HTML, RAG, MCP, NLWeb) on a fixed testbed of four simulated e-shops. All reported metrics (F1 scores, token usage, runtime) are obtained by executing specialized agents on identical task sets using specified LLMs and measuring outcomes directly. No equations, derivations, parameter fitting, or self-citations are used to generate the central claims; the results follow from running the experiments as described. The simulation's representativeness is an external validity concern, not a circularity issue in the reported numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical comparison study; central claims rest on the fidelity of the simulated e-shops and the fairness of the specialized agent implementations rather than new theoretical constructs or fitted parameters.

axioms (1)

domain assumption Simulated e-shops with controlled interfaces accurately reflect relative performance differences that would appear on real websites.
The evaluation uses four simulated shops to isolate interface effects but draws conclusions relevant to practical web agent deployment.

pith-pipeline@v0.9.0 · 5666 in / 1240 out tokens · 34619 ms · 2026-05-17T04:42:12.579775+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our evaluation shows that the RAG, MCP and NLWeb agents outperform HTML on both effectiveness and efficiency. Averaged over all tasks, F1 rises from 0.67 for HTML to between 0.75 and 0.77...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a testbed consisting of four simulated e-shops...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
cs.LG 2026-03 unverdicted novelty 5.0

The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, and et al. 2024. The BrowserGym Ecosystem for Web Agent Research. arXiv:2412.05467 [cs]

work page arXiv 2024
[2]

Xiang Deng, Yu Gu, Boyuan Zheng, and et al. 2023. Mind2Web: Towards a Generalist Agent for the Web.Advances in Neural Information Processing Systems 36 (2023), 28091–28114

work page 2023
[3]

Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. 2025. From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review. arXiv:2504.19678 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Yunfan Gao, Yun Xiong, Xinyu Gao, and et al. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Divyansh Garg, Shaun VanWeelden, Diego Caples, and et al. 2025. REAL: Bench- marking Autonomous Agents on Deterministic Simulations of Real Websites. arXiv:2504.11543 [cs.CL]

work page arXiv 2025
[6]

Woźniak, et al

Lars Krupp, Daniel Geißler, Pawel W. Woźniak, et al . 2025. Quantifying Web Agents: A Survey on Web Agent Performance and Efficiency. (2025). doi:10. 31219/osf.io/vhn2c.v2

work page 2025
[7]

Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, et al. 2025. DeepShop: A Benchmark for Deep Research Shopping Agents. arXiv:2506.02839 [cs.IR]

work page arXiv 2025
[8]

Ralph Peeters, Aaron Steiner, Luca Schwarz, and et al. 2025. WebMall – a Multi- Shop Benchmark for Evaluating Web Agents. arXiv:2508.13024 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Tatiana Petrova, Boris Bliznioukov, Aleksandr Puzikov, and Radu State. 2025. From Semantic Web and MAS to Agentic AI: A Unified Narrative of the Web of Agents. arXiv:2507.10644 [cs.CL]

work page arXiv 2025
[10]

Noah Shinn, Federico Cassano, Edward Berman, and et al. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning.arXiv preprint(2023)

work page 2023
[11]

Xu, Shuyan Zhou, and Graham Neubig

Yueqi Song, Frank F. Xu, Shuyan Zhou, and Graham Neubig. 2025. Beyond Brows- ing: API-Based Web Agents. InFindings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, Austria, 11066–11085. doi:10.18653/v1/2025.find...

work page doi:10.18653/v1/2025.findings-acl.577 2025
[12]

Jiangyuan Wang, Kejun Xiao, Qi Sun, et al . 2025. ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents. arXiv:2508.04266 [cs.CL]

work page arXiv 2025
[13]

Shunyu Yao, Howard Chen, John Yang, and et al. 2022. WebShop: Towards Scal- able Real-World Web Interaction with Grounded Language Agents. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35. 20744–20757

work page 2022
[14]

Shunyu Yao, Jeffrey Zhao, Dian Yu, and et al. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InProceedings of the Eleventh International Conference on Learning Representations

work page 2023
[15]

Xu, Hao Zhu, and et al

Shuyan Zhou, Frank F. Xu, Hao Zhu, and et al. 2023. WebArena: A Realistic Web Environment for Building Autonomous Agents. InInternational Conference on Learning Representations (ICLR)

work page 2023