SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents
Pith reviewed 2026-05-19 17:15 UTC · model grok-4.3
The pith
SimPersona learns discrete buyer types from raw clickstreams and maps them to tokens that guide LLM agents to simulate varied real buyers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A behavior-aware VQ-VAE compresses raw clickstreams into a discrete codebook of buyer types that captures both universal shopping patterns and the specific customer mix at each merchant. Each code is mapped to a persona token inserted into the LLM vocabulary; the agent is then fine-tuned on real browsing traces so that the token steers its actions toward the corresponding type. At inference a single forward pass through the encoder selects the type for any new buyer, and population-level rollouts sample types from the merchant's empirical distribution over the codebook to reproduce observed heterogeneity without per-store prompt engineering.
What carries the argument
Behavior-aware VQ-VAE that turns clickstream sequences into discrete buyer-type codes later mapped to dedicated persona tokens for LLM guidance.
If this is right
- Simulated buyers reach 78 percent conversion-rate alignment with real buyers across 42 held-out live stores.
- Distinct buyer types produce interpretable and varied behavioral patterns in shopping sessions.
- The method outperforms a baseline agent that has eight times more parameters on goal-oriented tasks.
- Merchant-specific population distributions are preserved when sampling buyer types for large-scale simulations.
- An open data pipeline converts raw event logs into buyer representations and training traces.
Where Pith is reading between the lines
- The same discrete types could serve as lightweight conditioning signals for testing how store layout changes affect different customer segments.
- Extending the codes to capture session-level state changes might allow agents to model evolving intent within a single visit.
- The persona tokens could transfer to other web-agent domains such as content recommendation or support chat to add population-level realism.
Load-bearing premise
The discrete codes learned from historical clickstreams represent stable buyer types that transfer to new stores and give the LLM effective non-overfitting guidance during fine-tuning and inference.
What would settle it
Running SimPersona agents on additional held-out storefronts and measuring a large gap between their simulated conversion rates and the actual rates recorded by real buyers on those stores.
Figures
read the original abstract
LLM-based web agents can navigate live storefronts, yet they often collapse to a single "average buyer" policy, failing to capture the heterogeneous and distributional nature of real buyer populations. Existing personalization methods rely on hand-crafted prompt-based personas that are brittle, difficult to scale, context-inefficient, and unable to faithfully represent population-level behavior. We introduce SimPersona, a novel framework that learns discrete buyer types from historical traffic and exposes them to LLM-based web agents as compact persona tokens. Given raw clickstreams, a behavior-aware VQ-VAE induces a discrete buyer-type space that captures the statistical structure of real buyer behavior and merchant-specific buyer population distributions. To provide behavior-specific guidance to LLM-based web agents, SimPersona maps each learned buyer type to a dedicated persona token in the LLM agent vocabulary and fine-tunes the agent with these tokens on real browsing traces. At inference, each synthetic buyer is assigned to a learned buyer type with a single encoder forward pass, requiring no retraining or store-specific prompt engineering. For population-level simulation, SimPersona samples buyer types from each merchant's empirical distribution over the learned VQ-VAE codebook and instantiates agents with the corresponding persona tokens, preserving merchant-specific buyer population distributions. Evaluated on $8.37$M buyers across $42$ held-out live storefronts, SimPersona achieves $78\%$ conversion-rate alignment with real buyers, exhibits interpretable behavioral variation across buyer types, and outperforms a baseline with $8\times$ more parameters on goal-oriented shopping tasks. We further release an open-source data pipeline that converts raw e-commerce event logs into buyer representations and agent-training traces.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SimPersona, a framework that learns discrete buyer types from raw e-commerce clickstreams via a behavior-aware VQ-VAE, maps these types to compact persona tokens in an LLM agent's vocabulary, fine-tunes the agent on real browsing traces, and at inference assigns synthetic buyers to types via a single encoder pass. For population simulation it samples from each merchant's empirical distribution over the learned codebook. The central empirical claim is that, when evaluated on 8.37M buyers across 42 held-out live storefronts, the resulting agents achieve 78% conversion-rate alignment with real buyers, display interpretable behavioral variation, and outperform an 8× larger baseline on goal-oriented shopping tasks. An open-source data pipeline converting event logs to buyer representations is also released.
Significance. If the generalization claims hold, the work offers a scalable, non-hand-crafted alternative to prompt-based personas for grounding LLM web agents in heterogeneous buyer populations. The combination of a learned discrete codebook with token-level fine-tuning and merchant-specific distribution sampling could materially improve simulation fidelity for e-commerce applications while remaining parameter-efficient. The released data pipeline is a concrete positive contribution that lowers the barrier for follow-on research.
major comments (3)
- [Abstract and §4] Abstract and §4 (evaluation protocol): the 78% conversion-rate alignment and transfer claims rest on the assumption that the VQ-VAE codebook and empirical distributions were learned from a merchant-disjoint training set. The manuscript must explicitly state the merchant split used for VQ-VAE training versus the 42 held-out storefronts; without this, the alignment metric risks reflecting merchant-specific memorization rather than merchant-agnostic buyer-type generalization.
- [§3.2 and §5.1] §3.2 and §5.1: the behavior-aware VQ-VAE is described as capturing both statistical structure and merchant-specific distributions, yet no ablation or sensitivity analysis is reported on codebook size, commitment loss weight, or encoder architecture. These are free parameters that directly affect the induced buyer-type space; their impact on downstream alignment and interpretability should be quantified.
- [Table 2 / §5.2] Table 2 / §5.2: the reported outperformance versus the 8× larger baseline lacks error bars, statistical significance tests, and a precise definition of the goal-oriented shopping task success metric. Without these, it is difficult to assess whether the persona-token guidance is the load-bearing factor or whether other training differences explain the gap.
minor comments (2)
- [§3.3] Notation: the mapping from VQ-VAE code indices to LLM persona tokens should be given an explicit equation or algorithm box for reproducibility.
- [Figure 3] Figure 3 (behavioral variation): axis labels and legend entries are too small for print; increase font size and add a short caption explaining how the plotted trajectories were generated.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the generalization claims and strengthen the empirical analysis. We address each major point below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (evaluation protocol): the 78% conversion-rate alignment and transfer claims rest on the assumption that the VQ-VAE codebook and empirical distributions were learned from a merchant-disjoint training set. The manuscript must explicitly state the merchant split used for VQ-VAE training versus the 42 held-out storefronts; without this, the alignment metric risks reflecting merchant-specific memorization rather than merchant-agnostic buyer-type generalization.
Authors: We agree that explicit clarification is necessary. The VQ-VAE codebook was trained on clickstreams from a merchant-disjoint collection of 87 storefronts, with the 42 evaluation storefronts held out entirely (no overlap in merchants or sessions). We will add this detail to the abstract, §4 (evaluation protocol), and a new paragraph in §3.2 describing the data splits. This ensures the reported 78% alignment measures cross-merchant generalization. revision: yes
-
Referee: [§3.2 and §5.1] §3.2 and §5.1: the behavior-aware VQ-VAE is described as capturing both statistical structure and merchant-specific distributions, yet no ablation or sensitivity analysis is reported on codebook size, commitment loss weight, or encoder architecture. These are free parameters that directly affect the induced buyer-type space; their impact on downstream alignment and interpretability should be quantified.
Authors: We acknowledge the value of such analysis. In the revision we will add a sensitivity study in §5.1 (and an accompanying table) varying codebook size (K=32, 64, 128, 256), commitment loss coefficient (0.1–1.0), and encoder depth, reporting effects on conversion-rate alignment, codebook utilization, and qualitative interpretability of the resulting buyer types. This will be computed on a fixed validation split to avoid additional compute overhead. revision: yes
-
Referee: [Table 2 / §5.2] Table 2 / §5.2: the reported outperformance versus the 8× larger baseline lacks error bars, statistical significance tests, and a precise definition of the goal-oriented shopping task success metric. Without these, it is difficult to assess whether the persona-token guidance is the load-bearing factor or whether other training differences explain the gap.
Authors: The success metric is the fraction of episodes in which the agent completes a purchase of the target item within a 20-step budget; this definition appears in §5.2 but will be restated more precisely. We will augment Table 2 with standard-deviation error bars computed over 5 independent fine-tuning seeds and add paired t-test p-values comparing SimPersona against the baseline. These additions will be included in the revised §5.2 and Table 2 caption. revision: yes
Circularity Check
No significant circularity detected; derivation is self-contained
full rationale
The paper trains a behavior-aware VQ-VAE on historical clickstreams to induce discrete buyer-type codes and merchant-specific distributions, then maps codes to persona tokens for fine-tuning LLM agents and evaluates conversion-rate alignment on 42 explicitly held-out live storefronts. The hold-out of storefronts separates the VQ-VAE training data from the evaluation merchants, so the reported 78% alignment and outperformance are measured against independent real-buyer traces rather than reducing to the fitted inputs by construction. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain. The framework remains empirically testable and does not collapse to tautology.
Axiom & Free-Parameter Ledger
free parameters (2)
- VQ-VAE codebook size (number of discrete buyer types)
- VQ-VAE training hyperparameters (e.g., commitment loss weight, encoder architecture)
axioms (2)
- domain assumption Raw clickstream sequences contain sufficient statistical structure to induce meaningful discrete buyer types that generalize across merchants.
- domain assumption Mapping learned types to dedicated persona tokens in the LLM vocabulary allows effective behavior-specific guidance without retraining the base model.
invented entities (2)
-
discrete buyer-type space induced by behavior-aware VQ-VAE
no independent evidence
-
persona tokens in LLM agent vocabulary
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a behavior-aware vector-quantized variational autoencoder (VQ-VAE) induces a discrete buyer-type space that captures the statistical structure of real buyer behavior
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-stage persona-grounding procedure that decouples learning what each token means from learning how to act on it
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
k-means++: The advantages of careful seeding
David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027–1035, 2007
work page 2007
-
[2]
Tadeusz Cali´nski and Jerzy Harabasz. A dendrite method for cluster analysis.Communications in Statistics – Theory and Methods, 3(1):1–27, 1974
work page 1974
-
[3]
Beyond demographics: Aligning role-playing llm-based agents using human belief networks
Yun-Shiuan Chuang, Krirk Nirunwiroj, Zach Studdiford, Agam Goyal, Vincent V Frigo, Sijia Yang, Dhavan V Shah, Junjie Hu, and Timothy T Rogers. Beyond demographics: Aligning role-playing llm-based agents using human belief networks. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 14010–14026, 2024
work page 2024
-
[4]
Lawrence Erlbaum Associates, 2 edition, 1988
Jacob Cohen.Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, 2 edition, 1988
work page 1988
-
[5]
Mind2web: Towards a generalist agent for the web
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[6]
Fisher.The Design of Experiments
Ronald A. Fisher.The Design of Experiments. Oliver and Boyd, 1935
work page 1935
-
[7]
The behavioral fabric of llm-powered gui agents: Human values and interaction outcomes
Simret Araya Gebreegziabher, Yukun Yang, Charles Chiang, Hojun Yoo, Chaoran Chen, Hyo Jin Do, Zahra Ashktorab, Werner Geyer, Diego Gómez-Zará, and Toby Jia-Jun Li. The behavioral fabric of llm-powered gui agents: Human values and interaction outcomes. InProceedings of the 31st International Conference on Intelligent User Interfaces, pages 909–927, 2026
work page 2026
-
[8]
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world WebAgent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Tobias Hatt and Stefan Feuerriegel. Detecting user exits from online behavior: A duration- dependent latent state model.arXiv preprint arXiv:2208.03937, 2022
-
[10]
William H. Kruskal and W. Allen Wallis. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260):583–621, 1952
work page 1952
-
[11]
Jianhua Lin. Divergence measures based on the Shannon entropy.IEEE Transactions on Information Theory, 37(1):145–151, 1991
work page 1991
-
[12]
Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data
Yuxuan Lu, Jing Huang, Yan Han, Bingsheng Yao, Sisong Bei, Jiri Gesi, Yaochen Xie, Zheshen Wang, Qi He, and Dakuo Wang. Can llm agents simulate multi-turn human behavior? evidence from real online customer behavior data.arXiv preprint arXiv:2503.20749, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Uxagent: An llm agent-based usability testing framework for web design
Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Zheshen Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, and Dakuo Wang. Uxagent: An llm agent-based usability testing framework for web design. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1–12, 2025
work page 2025
-
[14]
Sunnie S. Y . Lutz et al. The prompt makes the person(a): A systematic evaluation of sociode- mographic persona prompting for large language models. InFindings of the Association for Computational Linguistics: EMNLP 2025, 2025
work page 2025
-
[15]
Perceive your users in depth: Learning universal user representations from multiple e-commerce tasks
Jianmo Ni et al. Perceive your users in depth: Learning universal user representations from multiple e-commerce tasks. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018. 10
work page 2018
-
[16]
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023
work page 2023
-
[17]
LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals
Joon Sung Park et al. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Generating diverse high-fidelity images with VQ-V AE-2
Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-V AE-2. InAdvances in Neural Information Processing Systems, 2019
work page 2019
-
[19]
Character-llm: A trainable agent for role-playing
Yunfan Shao et al. Character-llm: A trainable agent for role-playing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
work page 2023
-
[20]
You are what you bought: Generating customer personas for e-commerce applications
Yimin Shi, Yang Fei, Shiqi Zhang, Haixun Wang, and Xiaokui Xiao. You are what you bought: Generating customer personas for e-commerce applications. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1810–1819, 2025
work page 2025
-
[21]
In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V
Yunxiao Shi, Wujiang Xu, Zeqi Zhang, Xing Zi, Qiang Wu, and Min Xu. Personax: A recommendation agent-oriented user modeling framework for long behavior sequence. In Findings of the Association for Computational Linguistics: ACL 2025, pages 5764–5787, Vienna, Austria, 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025. findings-acl.300
-
[22]
Neural discrete representation learning
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InAdvances in Neural Information Processing Systems, 2017
work page 2017
-
[23]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
Dakuo Wang, Ting-Yao Hsu, Yuxuan Lu, Limeng Cui, Yaochen Xie, William Headean, Bing- sheng Yao, Akash Veeragouni, Jiapeng Liu, Sreyashi Nag, and Jessie Wang. Agenta/b: Auto- mated and scalable web a/b testing with interactive llm agents.arXiv preprint arXiv:2504.09723, 2025
-
[25]
Gang Wang, Xinyi Zhang, Shiliang Tang, Haitao Zheng, and Ben Y . Zhao. Unsupervised clickstream clustering for user behavior analysis. InProceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pages 225–236. ACM, 2016. doi: 10.1145/2858036. 2858107
-
[26]
Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, and Dakuo Wang. Opera: A dataset of observation, persona, rationale, and action for evaluating llms on human online shopping behavior simulation.arXiv preprint arXiv:2506.05606...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.05606 2025
-
[27]
Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, and Dakuo Wang. Customer-r1: Personal- ized simulation of human behaviors via rl-based llm agent in online shopping.arXiv preprint arXiv:2510.07230, 2025
-
[28]
B. L. Welch. The generalization of ‘student’s’ problem when several different population variances are involved.Biometrika, 34(1/2):28–35, 1947
work page 1947
-
[29]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Wang, Bo Li, Bowen Liu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
TRACE: Transformer-based user representations from attributed clickstream event sequences
Dale Yang et al. TRACE: Transformer-based user representations from attributed clickstream event sequences. InProceedings of the ACM Web Conference, 2023
work page 2023
-
[31]
Webshop: Towards scalable real-world web interaction with grounded language agents
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems, volume 35, 2022. 11
work page 2022
-
[32]
Yimeng Zhang, Tian Wang, Jiri Gesi, Ziyi Wang, Yuxuan Lu, Jiacheng Lin, Sinong Zhan, Vianne Gao, Ruochen Jiao, Junze Liu, et al. Shop-r1: Rewarding llms to simulate human behavior in online shopping via reinforcement learning.arXiv preprint arXiv:2507.17842, 2025
-
[33]
A deep Markov model for clickstream analytics in online shopping
Wen Zheng et al. A deep Markov model for clickstream analytics in online shopping. In Proceedings of The Web Conference 2020, 2020
work page 2020
-
[34]
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024. 12 A Data Pipeline Figure 2 illustrates our end-to-end data pipeline described in Section 2...
work page 2024
-
[35]
you are interested in product X
over encoder outputs from a full pass through the training set. During training, entries are updated via exponential moving averages rather than gradient descent: ek ←γe k + (1−γ) ¯zk,(7) where ¯zk is the mean of encoder outputs assigned to entry k in the current batch and γ∈[0,1) controls the memory of past assignments. To prevent codebook collapse Razav...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.