τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.
Chatshop: Interactive information seeking with language agents.arXiv preprint arXiv:2404.09911
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
SimGym is a browser-based VLM agent framework that simulates A/B test outcomes on e-commerce storefronts with 77% directional agreement on add-to-cart shifts from real buyer traffic.
The paper proposes Amazon-Bench, a functionality-grounded benchmark for web agents in e-commerce that generates diverse task queries from webpage elements and evaluates both task performance and safety risks.
citing papers explorer
-
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.
-
SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents
SimGym is a browser-based VLM agent framework that simulates A/B test outcomes on e-commerce storefronts with 77% directional agreement on add-to-cart shifts from real buyer traffic.
-
A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains
The paper proposes Amazon-Bench, a functionality-grounded benchmark for web agents in e-commerce that generates diverse task queries from webpage elements and evaluates both task performance and safety risks.