pith. machine review for the scientific record. sign in

arxiv: 2605.14205 · v1 · submitted 2026-05-14 · 💻 cs.AI

Recognition: no theorem link

SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:52 UTC · model grok-4.3

classification 💻 cs.AI
keywords buyer personasclickstreamsLLM agentse-commerceVQ-VAEpersonalizationsimulation
0
0 comments X

The pith

SimPersona learns discrete buyer types from clickstreams to let LLM agents simulate diverse real buyer populations in e-commerce.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SimPersona to make LLM web agents behave like diverse real buyers instead of collapsing to an average policy. It uses a VQ-VAE to learn discrete buyer types directly from raw clickstream data, capturing how different customers browse and buy across various stores. These types are linked to special tokens that guide the LLM agent during fine-tuning on actual traces. For any merchant, buyer populations are simulated by drawing from the observed distribution of types, producing agents that match real conversion rates at 78 percent without custom prompts per store.

Core claim

By training a behavior-aware VQ-VAE on historical e-commerce clickstreams, SimPersona extracts a compact set of discrete buyer types that reflect the statistical structure of real buyer populations. Each type is assigned a unique persona token in the LLM agent's vocabulary, enabling fine-tuning that teaches type-specific navigation and purchase behaviors. At inference, agents are instantiated by mapping new or simulated buyers to these tokens, preserving merchant-specific distributions and achieving strong alignment with observed real-world outcomes.

What carries the argument

The behavior-aware VQ-VAE inducing the discrete buyer-type codebook from clickstreams, along with the mapping of types to persona tokens for LLM conditioning.

If this is right

  • Population-level simulations become possible by sampling buyer types according to each store's empirical distribution.
  • Agent assignment to a persona requires only one forward pass through the encoder with no retraining needed.
  • Goal-oriented shopping performance improves over baselines that use eight times more parameters.
  • Distinct behavioral patterns emerge across the learned buyer types, making them interpretable from click data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Store designers could use these simulated populations to test layout changes before deployment.
  • Updating the codebook periodically with new clickstreams might allow the system to track shifts in buyer behavior over time.
  • The discrete representation could reduce computational costs when running large-scale agent evaluations compared to fully prompt-based personalization.

Load-bearing premise

The buyer types discovered from past clickstreams will accurately predict behavior in ongoing live interactions without substantial changes in customer preferences or platform features.

What would settle it

If agents using the learned personas produce conversion rates that deviate significantly from real buyer data on additional unseen storefronts, or if the types fail to differentiate between customers with measurably different purchase histories.

Figures

Figures reproduced from arXiv: 2605.14205 by Alberto Castelo, Han Li, Lingyun Wang, Shuang Xie, Ted Chaiwachirasak, Zahra Zanjani Foumani.

Figure 1
Figure 1. Figure 1: SIMPERSONA framework overview. Top-left: behavioral features and product embeddings are extracted from raw clickstreams. Top-right: a behavior-aware VQ-VAE maps each buyer to one of K persona tokens. Bottom-right: two-stage SFT grounds the tokens in the LLM; first token warm-up (backbone frozen), then full fine-tuning. Bottom-left: evaluation on unseen storefronts across behavioral alignment, conversion al… view at source ↗
Figure 2
Figure 2. Figure 2: Data pipeline overview. A single enrichment pass over raw clickstream logs produces [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Data enrichment. Raw event-level tables are joined with the product catalog, collection [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: VQ-VAE input construction for a single buyer–shop pair. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: SFT trace generation from enriched clickstreams. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Stratum distribution recovery across all [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Store-level behavioral reconstruction from persona token distributions. The codebook [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-shop error-rate comparison between two-stage and single-stage SFT (sorted by two [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Two-stage persona-grounding SFT examples. Each training example consists of a system [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Persona token ablation under neutral intents. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
read the original abstract

LLM-based web agents can navigate live storefronts, yet they often collapse to a single "average buyer" policy, failing to capture the heterogeneous and distributional nature of real buyer populations. Existing personalization methods rely on hand-crafted prompt-based personas that are brittle, difficult to scale, context-inefficient, and unable to faithfully represent population-level behavior. We introduce SimPersona, a novel framework that learns discrete buyer types from historical traffic and exposes them to LLM-based web agents as compact persona tokens. Given raw clickstreams, a behavior-aware VQ-VAE induces a discrete buyer-type space that captures the statistical structure of real buyer behavior and merchant-specific buyer population distributions. To provide behavior-specific guidance to LLM-based web agents, SimPersona maps each learned buyer type to a dedicated persona token in the LLM agent vocabulary and fine-tunes the agent with these tokens on real browsing traces. At inference, each synthetic buyer is assigned to a learned buyer type with a single encoder forward pass, requiring no retraining or store-specific prompt engineering. For population-level simulation, SimPersona samples buyer types from each merchant's empirical distribution over the learned VQ-VAE codebook and instantiates agents with the corresponding persona tokens, preserving merchant-specific buyer population distributions. Evaluated on $8.37$M buyers across $42$ held-out live storefronts, SimPersona achieves $78\%$ conversion-rate alignment with real buyers, exhibits interpretable behavioral variation across buyer types, and outperforms a baseline with $8\times$ more parameters on goal-oriented shopping tasks. We further release an open-source data pipeline that converts raw e-commerce event logs into buyer representations and agent-training traces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents SimPersona, a framework that trains a behavior-aware VQ-VAE on raw clickstreams to induce a discrete codebook of buyer types, maps each code to a dedicated persona token in an LLM agent's vocabulary, and fine-tunes the agent on the same traces. At inference, buyer types are assigned via a single encoder pass and agents are instantiated by sampling from each merchant's empirical distribution over the codebook, enabling population-level simulation without store-specific prompt engineering. On 8.37M buyers across 42 held-out live storefronts the method reports 78% conversion-rate alignment with real buyers, interpretable behavioral variation across types, and outperformance versus an 8× larger baseline on goal-oriented tasks; an open-source pipeline converting event logs to buyer representations is also released.

Significance. If the transfer from offline VQ-VAE codes to live LLM policies holds, the work supplies a scalable, data-driven alternative to hand-crafted personas for grounding e-commerce agents in real population distributions. The large-scale held-out evaluation and open-source pipeline are concrete strengths that would support reproducibility and further research in sequential behavior modeling.

major comments (3)
  1. [Results section] Results section: the 78% conversion-rate alignment is presented without the precise definition of the metric, the exact baseline architecture, or explicit controls for store-specific confounders (e.g., UI differences or traffic seasonality), making it difficult to assess whether the reported outperformance is robust.
  2. [Method and Evaluation] Method and Evaluation: no quantitative comparison of simulated versus real session trajectories on the 42 held-out stores is reported (e.g., KL divergence or Wasserstein distance on next-action distributions conditioned on state and buyer type), so aggregate conversion alignment may mask per-type policy deviations under live site dynamics.
  3. [Inference procedure] Inference procedure: the claim that a single encoder forward pass plus persona token suffices for faithful transfer to new live interactions rests on the untested assumption that historical clickstream statistics remain representative under actual site response feedback loops; this load-bearing transfer step lacks direct validation.
minor comments (2)
  1. [Abstract] Abstract: adding the VQ-VAE codebook size (number of discrete types) used in the main experiments would give readers immediate context for the scale of the learned persona space.
  2. [Pipeline release] Pipeline release: the main text should include a short usage example or pointer to the exact repository contents so that the open-source contribution can be immediately reproduced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. Below we provide detailed responses to each major comment, indicating the revisions made to address them.

read point-by-point responses
  1. Referee: [Results section] Results section: the 78% conversion-rate alignment is presented without the precise definition of the metric, the exact baseline architecture, or explicit controls for store-specific confounders (e.g., UI differences or traffic seasonality), making it difficult to assess whether the reported outperformance is robust.

    Authors: We agree that greater clarity on the evaluation metric and controls is needed. In the revised manuscript, we have added the precise definition of the conversion-rate alignment metric as the percentage of stores where the absolute difference between simulated and real conversion rates is below 5%. We have also detailed the baseline architecture as an 8× larger LLM agent fine-tuned on the same traces without persona tokens, and incorporated explicit controls by aligning evaluation periods to account for seasonality and using the same storefront interfaces to mitigate UI confounders. revision: yes

  2. Referee: [Method and Evaluation] Method and Evaluation: no quantitative comparison of simulated versus real session trajectories on the 42 held-out stores is reported (e.g., KL divergence or Wasserstein distance on next-action distributions conditioned on state and buyer type), so aggregate conversion alignment may mask per-type policy deviations under live site dynamics.

    Authors: We concur that trajectory-level distributional comparisons would strengthen the claims. Although the current evaluation focuses on conversion alignment and interpretable type variations, we have now computed and added Wasserstein distances on the next-action distributions (conditioned on state and buyer type) between simulated and real sessions across the 42 held-out stores in the revised Evaluation section. revision: yes

  3. Referee: [Inference procedure] Inference procedure: the claim that a single encoder forward pass plus persona token suffices for faithful transfer to new live interactions rests on the untested assumption that historical clickstream statistics remain representative under actual site response feedback loops; this load-bearing transfer step lacks direct validation.

    Authors: The inference procedure is supported by the overall performance on live held-out interactions. However, we accept that the assumption regarding the representativeness of historical statistics under feedback loops lacks isolated direct validation. We have added a discussion of this assumption, including its potential limitations, to the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core pipeline trains a behavior-aware VQ-VAE on historical clickstreams to induce discrete buyer-type codes, maps those codes to LLM persona tokens, and fine-tunes the agent on the same traces before evaluating conversion-rate alignment on 42 held-out live storefronts. This is a standard empirical training-plus-held-out-evaluation workflow; the reported 78% alignment is measured against external real-buyer distributions rather than being a quantity that equals its own fitted inputs by construction. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method. The derivation therefore remains self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework depends on the VQ-VAE successfully capturing statistically meaningful buyer clusters from clickstreams and on those clusters remaining useful when injected as tokens into an LLM.

free parameters (1)
  • VQ-VAE codebook size (number of discrete buyer types)
    Hyperparameter that determines how many distinct personas are induced; its value is chosen to balance coverage and interpretability.
axioms (1)
  • domain assumption Behavior embeddings from clickstreams can be discretized into a finite codebook that preserves population-level statistical structure
    Invoked when the VQ-VAE is trained on raw traffic to produce merchant-specific distributions.

pith-pipeline@v0.9.0 · 5622 in / 1248 out tokens · 39175 ms · 2026-05-15T02:52:59.932218+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 5 internal anchors

  1. [1]

    k-means++: The advantages of careful seeding

    David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027–1035, 2007

  2. [2]

    A dendrite method for cluster analysis.Communications in Statistics – Theory and Methods, 3(1):1–27, 1974

    Tadeusz Cali´nski and Jerzy Harabasz. A dendrite method for cluster analysis.Communications in Statistics – Theory and Methods, 3(1):1–27, 1974

  3. [3]

    Beyond demographics: Aligning role-playing llm-based agents using human belief networks

    Yun-Shiuan Chuang, Krirk Nirunwiroj, Zach Studdiford, Agam Goyal, Vincent V Frigo, Sijia Yang, Dhavan V Shah, Junjie Hu, and Timothy T Rogers. Beyond demographics: Aligning role-playing llm-based agents using human belief networks. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 14010–14026, 2024

  4. [4]

    Lawrence Erlbaum Associates, 2 edition, 1988

    Jacob Cohen.Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, 2 edition, 1988

  5. [5]

    Mind2web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems, volume 36, 2023

  6. [6]

    Fisher.The Design of Experiments

    Ronald A. Fisher.The Design of Experiments. Oliver and Boyd, 1935

  7. [7]

    The behavioral fabric of llm-powered gui agents: Human values and interaction outcomes

    Simret Araya Gebreegziabher, Yukun Yang, Charles Chiang, Hojun Yoo, Chaoran Chen, Hyo Jin Do, Zahra Ashktorab, Werner Geyer, Diego Gómez-Zará, and Toby Jia-Jun Li. The behavioral fabric of llm-powered gui agents: Human values and interaction outcomes. InProceedings of the 31st International Conference on Intelligent User Interfaces, pages 909–927, 2026

  8. [8]

    A real-world WebAgent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023

    Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world WebAgent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023

  9. [9]

    Detecting user exits from online behavior: A duration- dependent latent state model.arXiv preprint arXiv:2208.03937, 2022

    Tobias Hatt and Stefan Feuerriegel. Detecting user exits from online behavior: A duration- dependent latent state model.arXiv preprint arXiv:2208.03937, 2022

  10. [10]

    Kruskal and W

    William H. Kruskal and W. Allen Wallis. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260):583–621, 1952

  11. [11]

    Divergence measures based on the Shannon entropy.IEEE Transactions on Information Theory, 37(1):145–151, 1991

    Jianhua Lin. Divergence measures based on the Shannon entropy.IEEE Transactions on Information Theory, 37(1):145–151, 1991

  12. [12]

    Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data

    Yuxuan Lu, Jing Huang, Yan Han, Bingsheng Yao, Sisong Bei, Jiri Gesi, Yaochen Xie, Zheshen Wang, Qi He, and Dakuo Wang. Can llm agents simulate multi-turn human behavior? evidence from real online customer behavior data.arXiv preprint arXiv:2503.20749, 2025

  13. [13]

    Uxagent: An llm agent-based usability testing framework for web design

    Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Zheshen Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, and Dakuo Wang. Uxagent: An llm agent-based usability testing framework for web design. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1–12, 2025

  14. [14]

    Sunnie S. Y . Lutz et al. The prompt makes the person(a): A systematic evaluation of sociode- mographic persona prompting for large language models. InFindings of the Association for Computational Linguistics: EMNLP 2025, 2025

  15. [15]

    Perceive your users in depth: Learning universal user representations from multiple e-commerce tasks

    Jianmo Ni et al. Perceive your users in depth: Learning universal user representations from multiple e-commerce tasks. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018. 10

  16. [16]

    O’Brien, Carrie J

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023

  17. [17]

    LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

    Joon Sung Park et al. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109, 2024

  18. [18]

    Generating diverse high-fidelity images with VQ-V AE-2

    Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-V AE-2. InAdvances in Neural Information Processing Systems, 2019

  19. [19]

    Character-llm: A trainable agent for role-playing

    Yunfan Shao et al. Character-llm: A trainable agent for role-playing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  20. [20]

    You are what you bought: Generating customer personas for e-commerce applications

    Yimin Shi, Yang Fei, Shiqi Zhang, Haixun Wang, and Xiaokui Xiao. You are what you bought: Generating customer personas for e-commerce applications. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1810–1819, 2025

  21. [21]

    Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, et al

    Yunxiao Shi, Wujiang Xu, Zeqi Zhang, Xing Zi, Qiang Wu, and Min Xu. Personax: A recommendation agent-oriented user modeling framework for long behavior sequence. In Findings of the Association for Computational Linguistics: ACL 2025, pages 5764–5787, Vienna, Austria, 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025. findings-acl.300

  22. [22]

    Neural discrete representation learning

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InAdvances in Neural Information Processing Systems, 2017

  23. [23]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  24. [24]

    Agenta/b: Auto- mated and scalable web a/b testing with interactive llm agents.arXiv preprint arXiv:2504.09723, 2025

    Dakuo Wang, Ting-Yao Hsu, Yuxuan Lu, Limeng Cui, Yaochen Xie, William Headean, Bing- sheng Yao, Akash Veeragouni, Jiapeng Liu, Sreyashi Nag, and Jessie Wang. Agenta/b: Auto- mated and scalable web a/b testing with interactive llm agents.arXiv preprint arXiv:2504.09723, 2025

  25. [25]

    Gang Wang, Xinyi Zhang, Shiliang Tang, Haitao Zheng, and Ben Y . Zhao. Unsupervised clickstream clustering for user behavior analysis. InProceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pages 225–236. ACM, 2016. doi: 10.1145/2858036. 2858107

  26. [26]

    OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

    Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, and Dakuo Wang. Opera: A dataset of observation, persona, rationale, and action for evaluating llms on human online shopping behavior simulation.arXiv preprint arXiv:2506.05606...

  27. [27]

    Customer-r1: Personal- ized simulation of human behaviors via rl-based llm agent in online shopping.arXiv preprint arXiv:2510.07230, 2025

    Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, and Dakuo Wang. Customer-r1: Personal- ized simulation of human behaviors via rl-based llm agent in online shopping.arXiv preprint arXiv:2510.07230, 2025

  28. [28]

    B. L. Welch. The generalization of ‘student’s’ problem when several different population variances are involved.Biometrika, 34(1/2):28–35, 1947

  29. [29]

    Qwen3 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Wang, Bo Li, Bowen Liu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  30. [30]

    TRACE: Transformer-based user representations from attributed clickstream event sequences

    Dale Yang et al. TRACE: Transformer-based user representations from attributed clickstream event sequences. InProceedings of the ACM Web Conference, 2023

  31. [31]

    Webshop: Towards scalable real-world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems, volume 35, 2022. 11

  32. [32]

    Shop-r1: Rewarding llms to simulate human behavior in online shopping via reinforcement learning.arXiv preprint arXiv:2507.17842, 2025

    Yimeng Zhang, Tian Wang, Jiri Gesi, Ziyi Wang, Yuxuan Lu, Jiacheng Lin, Sinong Zhan, Vianne Gao, Ruochen Jiao, Junze Liu, et al. Shop-r1: Rewarding llms to simulate human behavior in online shopping via reinforcement learning.arXiv preprint arXiv:2507.17842, 2025

  33. [33]

    A deep Markov model for clickstream analytics in online shopping

    Wen Zheng et al. A deep Markov model for clickstream analytics in online shopping. In Proceedings of The Web Conference 2020, 2020

  34. [34]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024. 12 A Data Pipeline Figure 2 illustrates our end-to-end data pipeline described in Section 2...

  35. [35]

    you are interested in product X

    over encoder outputs from a full pass through the training set. During training, entries are updated via exponential moving averages rather than gradient descent: ek ←γe k + (1−γ) ¯zk,(7) where ¯zk is the mean of encoder outputs assigned to entry k in the current batch and γ∈[0,1) controls the memory of past assignments. To prevent codebook collapse Razav...