Shoppingbench: A real-world intent-grounded shopping benchmark for llm-based agents

Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jian Dong Zhang, Xiaoyi Zeng · 2025 · arXiv 2508.04266

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents

cs.IR · 2026-05-11 · unverdicted · novelty 7.0

RecoAtlas is a benchmark that evaluates LLM recommendation agents on behavior-grounded metrics for relevance, complementarity, and diversity in addition to semantic coherence.

MCP vs RAG vs NLWeb vs HTML: A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web (Technical Report)

cs.CL · 2025-11-28 · accept · novelty 7.0

RAG, MCP, and NLWeb interfaces let LLM web agents achieve higher F1 scores (0.75-0.77 vs 0.67) and much lower token usage and runtime than HTML in controlled e-commerce tasks.

WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

cs.CL · 2025-08-18 · conditional · novelty 7.0

WebMall is the first offline multi-shop benchmark for evaluating LLM web agents on complex comparison shopping tasks across heterogeneous product data from multiple simulated e-shops.

citing papers explorer

Showing 3 of 3 citing papers.

RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents cs.IR · 2026-05-11 · unverdicted · none · ref 10
RecoAtlas is a benchmark that evaluates LLM recommendation agents on behavior-grounded metrics for relevance, complementarity, and diversity in addition to semantic coherence.
MCP vs RAG vs NLWeb vs HTML: A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web (Technical Report) cs.CL · 2025-11-28 · accept · none · ref 12
RAG, MCP, and NLWeb interfaces let LLM web agents achieve higher F1 scores (0.75-0.77 vs 0.67) and much lower token usage and runtime than HTML in controlled e-commerce tasks.
WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents cs.CL · 2025-08-18 · conditional · none · ref 20
WebMall is the first offline multi-shop benchmark for evaluating LLM web agents on complex comparison shopping tasks across heterogeneous product data from multiple simulated e-shops.

Shoppingbench: A real-world intent-grounded shopping benchmark for llm-based agents

fields

years

verdicts

representative citing papers

citing papers explorer