pith. sign in

arxiv: 2603.14864 · v2 · pith:L6C4LIYUnew · submitted 2026-03-16 · 💻 cs.CL

Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks

classification 💻 cs.CL
keywords shoppingtasksbenchmarkpreferenceagentcompaniondesigne-commerce
0
0 comments X
read the original abstract

In e-commerce, LLM agents show promise for shopping tasks such as recommendations, budget management, and bundle deals, where accurately capturing user preferences from long-horizon conversations is critical. However, progress is limited by two key challenges: (1) the absence of benchmarks for evaluating long-term preference-aware shopping tasks, and (2) the lack of fine-grained supervision for shopping agent training. To fill the benchmark gap, we introduce Shopping Companion Bench, a novel benchmark comprising two shopping tasks that require cross-session preference memory, grounded in a product pool of over 1.2 million real-world items. Our analysis further identifies two major sources of failure on this benchmark: cascading errors caused by preference hallucination, and insufficient verification of product attributes against user requirements. To address these failure modes, we design annotation-free, tool-wise rewards that provide process supervision for each tool call, alleviating reward sparsity in long-horizon tasks. Experimental results demonstrate that even state-of-the-art models such as GPT-5 achieve success rates below 70%, highlighting the difficulty of our benchmark. Notably, our fine-tuned lightweight 4B model consistently outperforms strong baselines in both preference capture and task performance, suggesting the effectiveness of our reward design.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

    cs.AI 2026-05 conditional novelty 7.0

    ShopGym introduces ShopArena to convert live storefronts into self-contained sandbox shops and ShopGuru to synthesize 224 benchmark tasks, with validation showing structural preservation and positive correlation of ag...