pith. sign in

arxiv: 2605.19219 · v1 · pith:N75YY7A6new · submitted 2026-05-19 · 💻 cs.AI

SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

Pith reviewed 2026-05-20 06:32 UTC · model grok-4.3

classification 💻 cs.AI
keywords e-commerceA/B testingsimulationVLM agentspersona generationUI evaluationbrowser agentstraffic grounding
0
0 comments X

The pith

Traffic-grounded VLM agents simulate e-commerce A/B test outcome shifts and achieve 77 percent directional alignment with real buyer behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Real A/B tests on e-commerce sites pull live traffic and often require weeks to reach reliable results while exposing users to unproven changes. SimGym builds a simulation by first creating buyer personas directly from production clickstream records, then running vision-language model agents through live browser sessions on both control and treatment storefronts. Each agent perceives the page visually and through browser structure, maintains episodic memory across clicks, and follows guardrails to complete shopping tasks. The framework measures how simulated add-to-cart rates shift between variants and compares those shifts against the shifts actually recorded in the original live experiments. Across multiple storefronts and product categories, the simulated directions matched the real observed directions in 77 percent of cases, cutting the cycle time from weeks to under an hour.

Core claim

SimGym demonstrates that VLM agents equipped with personas extracted from real clickstream data, multimodal browser observations, and episodic memory can generate outcome shifts that directionally track real buyer responses to visual UI theme changes, reaching 77 percent alignment on add-to-cart metrics while completing each simulated experiment in under an hour.

What carries the argument

Live-browser VLM agent architecture that fuses visual perception, browser-structured observations, episodic memory, and guardrails to run coherent shopping sessions across paired control and treatment storefronts.

If this is right

  • Product teams can screen dozens of UI variants per day before committing live traffic.
  • Experiments become feasible on low-traffic storefronts or niche categories where statistical power is otherwise unattainable.
  • Risk of negative user experience during testing drops because no real buyers encounter the candidate variants.
  • Iteration speed increases, allowing more frequent updates to theme, layout, and visual elements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same persona-plus-browser-agent pattern could be applied to test search ranking or recommendation changes if the agents are given access to those signals.
  • Hybrid workflows become possible in which simulation filters out clearly inferior variants before any live traffic is used.
  • The 77 percent directional match suggests a practical threshold for deciding when a simulated result is reliable enough to act on without further live testing.

Load-bearing premise

That personas built from clickstream data plus VLM agents operating on visual and browser observations will produce shopping behavior that tracks how real buyers respond to UI theme changes.

What would settle it

A fresh set of live A/B tests on new storefronts in which the simulated add-to-cart shift directions disagree with the observed real shifts in more than 23 percent of variants.

Figures

Figures reproduced from arXiv: 2605.19219 by Aaron Glazer, Ailin Fan, Alberto Castelo, Andrew McNamara, Angelo Ocana Martins, Francis Pelland, Han Li, Jonathan Faerman, Keat Yang Koay, Lingyun Wang, Meysam Feghhi, Mingyu Zhao, Nikolas LeBlanc, Ronie Uliana, Shuang Xie, Vibhor Malik, Yuanzheng Zhu, Zahra Zanjani Foumani, Zhaoyu Zhang, Zhong Wu.

Figure 1
Figure 1. Figure 1: SimGym framework overview. interfaces and execute multi-step tasks across diverse browser environments Zhou et al. [2023], Deng et al. [2023], Chezelles et al. [2024], Pan et al. [2024]. In parallel, persona and profile-conditioned agents have begun to exhibit more realistic patterns of user behavior Park et al. [2023], Wang et al. [2023], Zhang et al. [2024]. VLMs further strengthen this trend by enabling… view at source ↗
Figure 2
Figure 2. Figure 2: Dataset distribution of the 50-storefront golden set spanning 16 countries and 11 industries. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Human-agent agreement in A2C shifts. Each panel plots human-observed versus agent [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Agent sample-size sensitivity on the 50-shop golden set. Shaded bands denote the 10th–90th percentile range over 1000 bootstrap resamples. 5.3 Sensitivity Analysis Since shop-level simulated A2C shifts are estimated from a finite set of buyer-agent sessions, the agent budget (i.e., the number of agents per shop simulation) determines both estimate variance and simulation cost. We therefore select the agent… view at source ↗
Figure 5
Figure 5. Figure 5: Buyer Archetype Construction Framework. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Persona Extracted Output [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Agent Architecture. The Intent component, generated in Stage 3, pairs a sampled product target (here, “chairs”) with the fixed shopping guide described in Section 3, which anchors purchase decisions to the archetype. The Shopping Profile encodes two behavioral dimensions derived from Stage 4’s buyer behavior aggregation and Stage 5’s archetype construction: (1) Price Tier is set to "Budget" (price-consciou… view at source ↗
Figure 8
Figure 8. Figure 8: Agent Reasoning During Initial Navigation. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Screenshots of Agent Browsing. (a) Minis collection ($7–9). (b) Dragons collection [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Agent Reasoning During Collection Exploration. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Product Selection and Add-to-cart. (a) Premium Crystal Wing Dragon product page [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Agent Reasoning During Purchase Decision and Checkout. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
read the original abstract

A/B testing remains the gold standard for evaluating modifications to e-commerce storefronts, yet it diverts traffic, requires weeks to reach statistical significance, and risks degrading user experience. We present SimGym, a framework for simulating A/B tests on e-commerce storefronts using vision-language model (VLM) agents operating in a live browser. The framework comprises three key components: (a) a traffic-grounded persona generation pipeline that derives per-shop buyer archetypes and intents from production clickstream data; (b) a live-browser agent architecture that combines multimodal perception over visual and browser-structured observations with episodic memory and guardrails to conduct coherent shopping sessions across control and treatment storefronts; and (c) an evaluation protocol that compares simulated outcome shifts with observed shifts in real buyer behavior. We validate SimGym on A/B tests of visually driven UI theme changes from a major e-commerce platform across diverse storefronts and product categories. Empirical results show that SimGym agents achieve strong agreement with observed outcome shifts, attaining 77% directional alignment with add-to-cart shifts observed across interface variants in real-buyer traffic. It reduces experimental cycles from weeks to under an hour, enabling rapid experimentation without exposing real buyers to candidate variants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SimGym, a framework for simulating A/B tests on e-commerce storefronts using vision-language model (VLM) agents operating in a live browser. It consists of three components: a traffic-grounded persona generation pipeline deriving buyer archetypes from production clickstream data, a multimodal agent architecture combining visual and browser-structured observations with episodic memory and guardrails for coherent shopping sessions, and an evaluation protocol that compares simulated outcome shifts to real buyer behavior. Validation is performed on A/B tests of visually driven UI theme changes across diverse storefronts and categories from a major platform, with the central empirical result being 77% directional alignment on add-to-cart shifts.

Significance. If the reported alignment holds under detailed statistical validation, the work offers a practical advance for e-commerce experimentation by compressing multi-week A/B cycles into sub-hour simulations while avoiding real-user exposure. The traffic-grounded persona pipeline and live-browser multimodal setup represent a concrete step toward more realistic behavioral simulation than purely synthetic or rule-based alternatives, with potential applicability beyond e-commerce to other interface-testing domains.

major comments (2)
  1. [Evaluation protocol / §4] Evaluation protocol (described in abstract and §4): the central claim of 77% directional alignment with real add-to-cart shifts is presented without any reported sample sizes for either simulated or real traffic, statistical tests for significance, controls for multiple comparisons across variants, or handling of post-hoc exclusions. This information is required to assess whether the alignment exceeds chance and is load-bearing for the empirical validation.
  2. [Agent architecture / §3] Agent architecture and persona pipeline (abstract and §3): the assumption that VLM agents with visual perception and clickstream-derived personas will produce shifts comparable to real buyers for subtle UI theme changes (colors, spacing, imagery) is not accompanied by any explicit perceptual fidelity check or ablation isolating visual interpretation from guardrails or memory. Given that clickstream data primarily encodes action sequences rather than aesthetic decision factors, this is a load-bearing assumption for the 77% alignment result.
minor comments (2)
  1. [Introduction] The abstract and introduction would benefit from a brief comparison table or paragraph contrasting SimGym against prior simulation approaches (e.g., rule-based or purely LLM-based agents) to clarify the incremental contribution of the traffic-grounded VLM component.
  2. [Evaluation protocol] Notation for outcome metrics (e.g., directional alignment) should be defined explicitly with a formula or pseudocode in the evaluation section to avoid ambiguity in how 'directional' is computed across control/treatment pairs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have prepared revisions to improve the clarity and rigor of the empirical claims.

read point-by-point responses
  1. Referee: [Evaluation protocol / §4] Evaluation protocol (described in abstract and §4): the central claim of 77% directional alignment with real add-to-cart shifts is presented without any reported sample sizes for either simulated or real traffic, statistical tests for significance, controls for multiple comparisons across variants, or handling of post-hoc exclusions. This information is required to assess whether the alignment exceeds chance and is load-bearing for the empirical validation.

    Authors: We agree that these statistical details are essential. In the revised manuscript we will expand §4 to report the number of real A/B tests evaluated, the number of simulated sessions per variant, the scale of the corresponding real traffic logs, and the results of a binomial sign test assessing whether the directional agreement rate significantly exceeds chance. We will also clarify that no post-hoc exclusions were performed and that the tests across independent storefronts do not require multiple-comparison correction. revision: yes

  2. Referee: [Agent architecture / §3] Agent architecture and persona pipeline (abstract and §3): the assumption that VLM agents with visual perception and clickstream-derived personas will produce shifts comparable to real buyers for subtle UI theme changes (colors, spacing, imagery) is not accompanied by any explicit perceptual fidelity check or ablation isolating visual interpretation from guardrails or memory. Given that clickstream data primarily encodes action sequences rather than aesthetic decision factors, this is a load-bearing assumption for the 77% alignment result.

    Authors: We agree that an explicit ablation would strengthen the presentation. Clickstream data is used only to derive high-level intents and session patterns; visual interpretation of subtle UI elements is performed by the VLM on live screenshots. The primary evidence for the assumption remains the empirical match to real A/B outcomes on visual theme changes. In revision we will add a discussion of the visual processing pipeline together with an ablation that replaces screenshots with textual page summaries, while noting that a dedicated human perceptual-fidelity study lies outside the present scope. revision: partial

Circularity Check

0 steps flagged

No significant circularity; central result validated against external real-buyer outcomes

full rationale

The paper's core claim is an empirical 77% directional alignment between simulated A/B outcome shifts and independently observed real-buyer traffic shifts on UI theme variants. This alignment metric is defined and measured against external production A/B test data rather than being fitted from or derived by construction from the SimGym parameters, personas, or VLM guardrails. Persona generation from clickstream data and the VLM agent architecture are presented as modeling choices whose fidelity is then tested externally; no self-definitional loop, fitted-input-as-prediction, or self-citation load-bearing step reduces the reported agreement to the inputs themselves. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the premise that clickstream-derived personas plus VLM perception suffice to simulate buyer responses; no free parameters or invented physical entities are mentioned, but the agent architecture itself is a constructed system.

axioms (1)
  • domain assumption VLM agents with multimodal perception and episodic memory can maintain coherent shopping sessions across control and treatment storefronts
    Invoked in the description of the live-browser agent architecture
invented entities (1)
  • Traffic-grounded persona generation pipeline no independent evidence
    purpose: Derive per-shop buyer archetypes and intents from production clickstream data
    New pipeline component introduced to ground the agents

pith-pipeline@v0.9.0 · 5829 in / 1276 out tokens · 32081 ms · 2026-05-20T06:32:00.984534+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 6 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

  2. [2]

    Fara-7b: An efficient agentic model for computer use.arXiv preprint arXiv:2511.19663, 2025

    Ahmed Awadallah, Yash Lara, Raghav Magazine, Hussein Mozannar, Akshay Nambi, Yash Pandya, Aravind Rajeswaran, Corby Rosset, Alexey Taymanov, Vibhav Vineet, et al. Fara-7b: An efficient agentic model for computer use.arXiv preprint arXiv:2511.19663,

  3. [3]

    Large language models empowered personalized web agents

    Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin Zhu, Xiaoyu Shen, Wenjie Li, and Tat-Seng Chua. Large language models empowered personalized web agents. InProceedings of the ACM on Web Conference 2025, pages 198–215,

  4. [4]

    Chatshop: Interactive information seeking with language agents.arXiv preprint arXiv:2404.09911,

    Sanxing Chen, Sam Wiseman, and Bhuwan Dhingra. Chatshop: Interactive information seeking with language agents.arXiv preprint arXiv:2404.09911,

  5. [5]

    Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste

    De Chezelles, Thibault Le Sellier, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F Xu, Siva Reddy, Quentin Cappart, et al. The browsergym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,

  6. [6]

    Gemini 3 flash - model card

    Google DeepMind. Gemini 3 flash - model card. https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-3-Flash-Model-Card.pdf , December 2025a. Model card. Published December 2025; updated 17 December

  7. [7]

    Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559,

    Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, et al. Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559,

  8. [8]

    A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

    Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856,

  9. [9]

    Lmagent: A large-scale multimodal agents society for multi-user simulation.arXiv preprint arXiv:2412.09237,

    Yijun Liu, Wu Liu, Xiaoyan Gu, Yong Rui, Xiaodong He, and Yongdong Zhang. Lmagent: A large-scale multimodal agents society for multi-user simulation.arXiv preprint arXiv:2412.09237,

  10. [10]

    Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=

    doi: 10.1145/3706599.3719729. Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, and Xiuying Chen. Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839,

  11. [11]

    Paars: Persona aligned agentic retail shoppers

    Saab Mansour, Leonardo Perelli, Lorenzo Mainetti, George Davidson, and Stefano D’Amato. Paars: Persona aligned agentic retail shoppers. InProceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), pages 143–159,

  12. [12]

    WebCanvas: Benchmarking Web Agents in Online Environments

    Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. Webcanvas: Benchmarking web agents in online environments.arXiv preprint arXiv:2406.12373,

  13. [13]

    WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

    Ralph Peeters, Aaron Steiner, Luca Schwarz, Julian Yuya Caspary, and Christian Bizer. Webmall–a multi-shop benchmark for evaluating web agents [technical report].arXiv preprint arXiv:2508.13024,

  14. [14]

    SimAB: Simulating A/B tests with persona-conditioned AI agents for rapid design evaluation.arXiv preprint arXiv:2603.01024,

    Tim Rieder, Marian Schneider, Mario Truss, Vitaly Tsaplin, Alina Rublea, Sinem Dere, Francisco Chicharro Sanz, Tobias Reiss, and Mustafa Doga Dogan. SimAB: Simulating A/B tests with persona-conditioned AI agents for rapid design evaluation.arXiv preprint arXiv:2603.01024,

  15. [15]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

  16. [16]

    Llm agent meets agentic ai: Can llm agents simulate customers to evaluate agentic-ai-based shopping assistants?arXiv preprint arXiv:2509.21501,

    Lu Sun, Shihan Fu, Bingsheng Yao, Yuxuan Lu, Wenbo Li, Hansu Gu, Jiri Gesi, Jing Huang, Chen Luo, and Dakuo Wang. Llm agent meets agentic ai: Can llm agents simulate customers to evaluate agentic-ai-based shopping assistants?arXiv preprint arXiv:2509.21501,

  17. [17]

    Shoppingcomp: Are llms really ready for your shopping cart?arXiv preprint arXiv:2511.22978,

    Huaixiao Tou, Ying Zeng, Cong Ma, Muzhi Li, Minghao Li, Weijie Yuan, He Zhang, and Kai Jia. Shoppingcomp: Are llms really ready for your shopping cart?arXiv preprint arXiv:2511.22978,

  18. [18]

    Agenta/b: Automated and scalable web a/btesting with interactive llm agents.arXiv preprint arXiv:2504.09723, 2025

    Dakuo Wang, Ting-Yao Hsu, Yuxuan Lu, Hansu Gu, Limeng Cui, Yaochen Xie, William Headean, Bingsheng Yao, Akash Veeragouni, Jiapeng Liu, et al. Agenta/b: Automated and scalable web a/btesting with interactive llm agents.arXiv preprint arXiv:2504.09723, 2025a. Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jian Dong Zhang, and Xiaoyi Zeng. Shop-...

  19. [19]

    OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

    Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, et al. Opera: A dataset of observation, persona, rationale, and action for evaluating llms on human online shopping behavior simulation.arXiv preprint arXiv:2506.05606, 2025c. Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, and Dakuo Wang. Cust...

  20. [20]

    See, think, act: Online shopper behavior simulation with vlm agents.arXiv preprint arXiv:2510.19245, 2025

    Yimeng Zhang, Jiri Gesi, Ran Xue, Tian Wang, Ziyi Wang, Yuxuan Lu, Sinong Zhan, Huimin Zeng, Qingjun Cui, Yufan Guo, et al. See, think, act: Online shopper behavior simulation with vlm agents.arXiv preprint arXiv:2510.19245, 2025a. Yimeng Zhang, Tian Wang, Jiri Gesi, Ziyi Wang, Yuxuan Lu, Jiacheng Lin, Sinong Zhan, Vianne Gao, Ruochen Jiao, Junze Liu, et ...