pith. sign in

arxiv: 2508.13024 · v3 · submitted 2025-08-18 · 💻 cs.CL

WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

Pith reviewed 2026-05-18 22:26 UTC · model grok-4.3

classification 💻 cs.CL
keywords web agentse-commerce benchmarkmulti-shop evaluationcomparison shoppingLLM agentsoffline simulationtask completionretrieval challenges
0
0 comments X p. Extension

The pith

WebMall introduces the first offline benchmark simulating multiple shops to test web agents on complex comparison shopping tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WebMall to fill a gap in existing benchmarks, which are limited to single shops or hard-to-reproduce online settings. It creates four simulated shops with varied product information and defines tasks that require agents to search across shops, compare prices, locate substitutes or complements, and complete purchases. Testing eight agents that vary in how they observe the environment, use memory, and select their language model reveals low success rates. The strongest agents finish fewer than 65 percent of the tasks in the cheapest-product and vague-search categories. This setup allows exact reproduction of experiments while exposing current limits in handling realistic multi-shop retrieval.

Core claim

WebMall consists of four simulated shops populated with product data extracted from the web. The tasks range from specific product searches and price comparisons to advanced searches for complementary or substitute products, as well as checkout processes. When validated using eight agents that differ in observation space, availability of short-term memory, and the employed language model, the best-performing agents achieve task completion rates below 65 percent in the cheapest product search and vague product search categories.

What carries the argument

The WebMall benchmark, a set of four simulated shops and a suite of comparison-shopping tasks that force agents to retrieve and compare information across heterogeneous sources.

If this is right

  • Agents must develop better strategies for switching between shops and comparing results rather than treating each shop in isolation.
  • Reproducible offline environments become feasible for testing long sequences that include search, comparison, and checkout.
  • Low completion rates in the hardest categories point to specific weaknesses in current observation and memory mechanisms.
  • The benchmark supplies a fixed testbed that can be used to measure incremental gains as new agent designs are introduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Success on WebMall could serve as a stepping stone toward agents that handle everyday consumer decisions across competing retailers without human oversight.
  • The same simulation approach could be extended to other domains that require cross-site comparison, such as booking travel or gathering service quotes.
  • If agents improve on these tasks, the next question becomes whether the same improvements transfer when the underlying product data and site layouts change over time.

Load-bearing premise

The product data and navigation structure in the four simulated shops are representative of the variety and retrieval challenges that appear when agents interact with multiple real online stores.

What would settle it

Running the same eight agents on live versions of several actual e-commerce sites and observing whether their task completion rates rise substantially above the levels reported on WebMall would indicate that the simulation understates or overstates real-world difficulty.

Figures

Figures reproduced from arXiv: 2508.13024 by Aaron Steiner, Christian Bizer, Julian Yuya Caspary, Luca Schwarz, Ralph Peeters.

Figure 1
Figure 1. Figure 1: Product detail page (left) and checkout page (right) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cost versus task completion rate for the basic (left) and advanced (right) task set. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

LLM-based web agents have the potential to automate long-running web tasks, such as searching for products in multiple e-shops and subsequently ordering the cheapest products that meet the users needs. Benchmarks for evaluating web agents either require agents to perform tasks online using the live Web or offline using simulated environments, the latter allowing for the exact reproduction of the experimental setup. While DeepShop and ShoppingComp provide online benchmarks that require agents to perform challenging shopping tasks, existing offline benchmarks such as WebShop, WebArena, and Mind2Web cover only comparatively simple e-commerce tasks performed against a single shop containing product data from a single source. What is missing is an e-commerce benchmark that simulates multiple shops containing heterogeneous product data and requires agents to perform complex retrieval tasks. We fill this gap by introducing WebMall, the first offline multi-shop benchmark for evaluating web agents on challenging comparison shopping tasks. WebMall consists of four simulated shops populated with product data extracted from the Common Crawl. The WebMall tasks range from specific product searches and price comparisons to advanced searches for complementary or substitute products, as well as checkout processes. We validate WebMall using eight agents that differ in observation space, availability of short-term memory, and the employed LLM. The validation highlights the difficulty of the benchmark, with the best-performing agents achieving task completion rates below 65% in the task categories cheapest product search and vague product search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces WebMall as the first offline multi-shop benchmark for web agents, consisting of four simulated shops populated with product data from Common Crawl. It defines tasks ranging from specific product searches and price comparisons to searches for complementary or substitute products and checkout processes. Validation experiments with eight agents (varying in observation space, short-term memory, and LLM) report task completion rates below 65% for the best agents on cheapest-product-search and vague-product-search categories, positioning WebMall as more challenging than prior single-shop offline benchmarks such as WebShop, WebArena, and Mind2Web.

Significance. If the four shops exhibit genuine heterogeneity in product overlap, pricing, and textual descriptions, WebMall would address a clear gap by enabling reproducible evaluation of cross-shop retrieval and comparison tasks. The offline simulation design is a strength for exact reproducibility. The reported low success rates on key categories provide preliminary evidence of difficulty, but the overall significance remains conditional on quantitative validation that the observed failures arise from intended multi-shop complexity rather than data artifacts.

major comments (2)
  1. The central claim that WebMall is meaningfully harder than single-shop benchmarks rests on the four shops containing overlapping but non-identical items with varied pricing and distinct descriptions. However, the manuscript provides no quantitative characterization (e.g., inter-shop overlap percentages, price dispersion statistics, attribute completeness rates, or lexical diversity metrics) of the Common Crawl-derived data. Without these, the <65% success rates on cheapest-product and vague-product tasks cannot be confidently attributed to multi-shop retrieval challenges rather than simulation artifacts such as uniform schemas or low textual variation.
  2. The validation section and abstract supply limited detail on exact task definitions, success criteria (e.g., precise matching rules for 'cheapest' or 'vague' queries), number of tasks per category, and statistical analysis (e.g., confidence intervals or variance across runs). This weakens the evidential support for the headline difficulty result and makes it hard to interpret the performance differences across the eight agents.
minor comments (3)
  1. Clarify the precise observation spaces (e.g., HTML vs. DOM vs. screenshot) and memory mechanisms used by each of the eight agents, ideally in a dedicated table.
  2. Add example task instances and ground-truth success conditions for each category to improve reproducibility.
  3. Ensure consistent terminology between 'vague product search' in the abstract and any corresponding section heading or definition in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that additional quantitative characterization of the product data and expanded details on tasks and evaluation will strengthen the manuscript, and we will incorporate these in the revision.

read point-by-point responses
  1. Referee: The central claim that WebMall is meaningfully harder than single-shop benchmarks rests on the four shops containing overlapping but non-identical items with varied pricing and distinct descriptions. However, the manuscript provides no quantitative characterization (e.g., inter-shop overlap percentages, price dispersion statistics, attribute completeness rates, or lexical diversity metrics) of the Common Crawl-derived data. Without these, the <65% success rates on cheapest-product and vague-product tasks cannot be confidently attributed to multi-shop retrieval challenges rather than simulation artifacts such as uniform schemas or low textual variation.

    Authors: We agree that quantitative metrics would help substantiate that the observed difficulties arise from multi-shop heterogeneity rather than data artifacts. In the revised manuscript we will add a dedicated subsection (likely in Section 3 or 4) reporting inter-shop product overlap percentages, price dispersion for matched items, attribute completeness rates, and lexical diversity metrics (e.g., type-token ratio or embedding variance) computed on the Common Crawl-derived catalogs. These statistics will directly support the claim that WebMall introduces genuine cross-shop comparison challenges. revision: yes

  2. Referee: The validation section and abstract supply limited detail on exact task definitions, success criteria (e.g., precise matching rules for 'cheapest' or 'vague' queries), number of tasks per category, and statistical analysis (e.g., confidence intervals or variance across runs). This weakens the evidential support for the headline difficulty result and makes it hard to interpret the performance differences across the eight agents.

    Authors: We acknowledge that greater precision is needed. The revision will expand the task definitions in Section 4 and the experimental setup in Section 5 to include: (i) verbatim task templates and query examples for each category, (ii) explicit success criteria with matching rules (e.g., exact product ID match for cheapest-product tasks and semantic similarity thresholds for vague-product tasks), (iii) the exact number of tasks per category, and (iv) statistical analysis including standard deviations and 95% confidence intervals computed over multiple runs. This will improve reproducibility and allow clearer comparison across agents. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction without derivations or self-referential predictions

full rationale

This is a benchmark construction paper that describes the creation of WebMall from Common Crawl product data across four simulated shops and reports empirical task completion rates for eight agents. No equations, fitted parameters, uniqueness theorems, or predictions are present that could reduce to inputs by construction. The central claims rest on the explicit construction process and observed performance numbers rather than any self-definitional loop, self-citation load-bearing argument, or renamed known result. The absence of a derivation chain makes circularity analysis inapplicable; the paper is self-contained against its stated empirical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on domain assumptions about data representativeness and simulation fidelity rather than new mathematical parameters or invented entities.

axioms (1)
  • domain assumption Product data extracted from Common Crawl is sufficiently representative of real e-commerce heterogeneity across multiple shops.
    This assumption underpins the construction of the four simulated shops and the realism of retrieval tasks.

pith-pipeline@v0.9.0 · 5787 in / 1176 out tokens · 36870 ms · 2026-05-18T22:26:46.741993+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MCP vs RAG vs NLWeb vs HTML: A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web (Technical Report)

    cs.CL 2025-11 accept novelty 7.0

    RAG, MCP, and NLWeb interfaces let LLM web agents achieve higher F1 scores (0.75-0.77 vs 0.67) and much lower token usage and runtime than HTML in controlled e-commerce tasks.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Alexander Brinkmann, Anna Primpeli, and Christian Bizer. 2023. The web data commons schema. org data set series. In Companion Proceedings of the ACM Web Conference 2023. 136–139

  2. [2]

    Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, et al. 2024. The BrowserGym Ecosystem for Web Agent Research. arXiv:2412.05467 [cs]

  3. [3]

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, et al . 2023. Mind2Web: Towards a Generalist Agent for the Web.Advances in Neural Infor- mation Processing Systems 36 (2023), 28091–28114

  4. [4]

    Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao

  5. [5]

    DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

    DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents. arXiv:2506.11763 [cs]

  6. [6]

    Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. 2025. From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review. arXiv:2504.19678 [cs]

  7. [7]

    Divyansh Garg, Shaun VanWeelden, Diego Caples, Andis Draguns, Nikil Ravi, et al. 2025. REAL: Benchmarking Autonomous Agents on Deterministic Simula- tions of Real Websites. arXiv:2504.11543 [cs]

  8. [8]

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, et al

  9. [9]

    In Proceedings of the International Conference on Learning Representations 2024: Workshop on Large Language Model (LLM) Agents

    VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. In Proceedings of the International Conference on Learning Representations 2024: Workshop on Large Language Model (LLM) Agents

  10. [10]

    Woźniak, Paul Lukowicz, and Jakob Karo- lus

    Lars Krupp, Daniel Geißler, Paweł W. Woźniak, Paul Lukowicz, and Jakob Karo- lus. 2025. Quantifying Web Agents-A Survey on Web Agent Performance and Efficiency. OSF (2025)

  11. [11]

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, et al. 2023. AgentBench: Evaluating LLMs as Agents. InProceedings of the Twelfth International Conference on Learning Representations

  12. [12]

    Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, et al. 2025. DeepShop: A Benchmark for Deep Research Shopping Agents. arXiv:2506.02839 [cs.IR] https://arxiv.org/abs/2506.02839

  13. [13]

    Atsuyuki Miyai, Zaiying Zhao, Kazuki Egashira, Atsuki Sato, Tatsumi Sunada, et al. 2025. WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks. arXiv:2506.01952 [cs]

  14. [14]

    A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

    Pascal J. Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, et al. 2025. A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions. arXiv:2501.16150 [cs]

  15. [15]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. Advances in Neural Information Processing Systems 36 (Dec. 2023), 8634–8652

  16. [16]

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, et al

  17. [17]

    Transactions on Machine Learning Research (Nov

    Voyager: An Open-Ended Embodied Agent with Large Language Models. Transactions on Machine Learning Research (Nov. 2023)

  18. [18]

    Haoxin Wang, Xianhan Peng, Xucheng Huang, Yizhe Huang, Ming Gong, et al

  19. [19]

    ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues? arXiv:2507.05639 [cs]

  20. [20]

    Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, et al . 2025. ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM- based Agents. arXiv:2508.04266 [cs]

  21. [21]

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. 2024. A survey on large language model based autonomous agents. Frontiers of Computer Science 18, 6 (March 2024). doi:10.1007/s11704- 024-40231-1

  22. [22]

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, et al

  23. [23]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents. arXiv:2504.12516 [cs]

  24. [24]

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao

  25. [25]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. arXiv:2310.11441 [cs.CV]

  26. [26]

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. Advances in Neural Information Processing Systems 35 (Dec. 2022), 20744–20757

  27. [27]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, et al. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. In Proceedings of the Eleventh International Conference on Learning Representations

  28. [28]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, et al

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, et al. 2023. We- bArena: A Realistic Web Environment for Building Autonomous Agents. In Proceedings of the Twelfth International Conference on Learning Representations