Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data
Pith reviewed 2026-05-22 22:10 UTC · model grok-4.3
The pith
Prompt-based LLMs generate real human shopping actions with only 11.86 percent accuracy across 31,865 sessions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Prompt-based LLMs achieve only 11.86 percent accuracy in generating human actions that match recorded behavior in real online shopping sessions, while fine-tuning on real human click-through data augmented with synthesized reasoning traces improves the fine-tuned Qwen2.5-7B model to 17.26 percent action generation accuracy and 33.86 percent F1 score on final purchase prediction.
What carries the argument
The accuracy metric that counts exact matches between model-generated actions and the 230,965 recorded user actions across 31,865 shopping sessions.
If this is right
- Fine-tuning on real click data with added reasoning traces produces measurable gains in both action accuracy and purchase prediction.
- The 5.4 percent and 13.85 percent improvements over prompt baselines show that task-specific training data can narrow the simulation gap.
- The 31,865-session dataset supplies the first large-scale quantitative benchmark for multi-turn human behavior simulation.
- Downstream applications that rely on accurate step-by-step human modeling will need methods beyond zero-shot prompting.
Where Pith is reading between the lines
- Shopping may expose weaknesses in current LLMs that other multi-turn tasks such as dialogue or planning do not reveal as clearly.
- Models that reach higher action accuracy could enable more reliable synthetic user testing for e-commerce interfaces.
- The remaining gap after fine-tuning suggests that richer state representations or explicit memory mechanisms may still be required.
Load-bearing premise
The logged actions in the collected shopping sessions form a complete and unbiased record of what users would have done under the same conditions.
What would settle it
A replication experiment on the same 31,865 sessions in which any prompt-only LLM reaches above 30 percent exact action match would falsify the reported gap.
read the original abstract
Recent research shows that LLM Agents can generate ``believable'' human behaviors via prompt-only methods, and such agents have been increasingly adopted in downstream applications. However, existing evaluation of these agents only focuses on qualitative believability (whether human raters think they are accurate), leaving open questions of whether LLM agents can accurately generate step-by-step actions mimicking a particular human's behavior in a multi-turn interaction task. In this work, we take shopping as a case study and present the first large-scale quantitative evaluation of state-of-the-art LLMs' ability to accurately simulate human behavior. Using real-world data from 31,865 online shopping sessions containing 230,965 user actions, our evaluation reveals that prompt-based LLMs (DeepSeek-R1, Llama, Claude) achieve only 11.86% accuracy in generating human actions, highlighting a substantial gap in actual behavioral accuracy. Through experiments, we also showcase that strategies as simple as fine-tuning LLMs on real human click-through data augmented with synthesized reasoning traces can greatly enhance models' performance. The fine-tuned Qwen2.5-7B achieves 17.26% action generation accuracy and 33.86% F1 score on final purchase prediction, representing substantial improvements of 5.4% and 13.85% over prompt-only baselines. This work establishes the first rigorous benchmark for human behavior simulation and provides actionable insights for developing more accurate LLM agents for future downstream applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates whether prompt-based LLM agents can accurately simulate multi-turn human shopping behavior, using a dataset of 31,865 real sessions with 230,965 actions as ground truth. It reports that models such as DeepSeek-R1, Llama, and Claude achieve only 11.86% action-generation accuracy, while fine-tuning Qwen2.5-7B on click-through data augmented with synthesized reasoning traces raises accuracy to 17.26% and purchase-prediction F1 to 33.86%. The work presents these results as the first large-scale quantitative benchmark for behavioral simulation fidelity.
Significance. If the evaluation ensures equivalent state information for the LLM at each turn, the reported accuracy gap would indicate that current prompt-only LLMs are substantially limited in replicating specific human decision sequences in interactive settings. The scale of the real-world dataset and the concrete improvement from fine-tuning constitute strengths that could guide future agent development. The result's broader impact on applications such as user modeling depends on confirming that the accuracy metric isolates behavioral modeling rather than information asymmetry.
major comments (2)
- [Evaluation Setup] The central claim of 11.86% accuracy in the abstract and results requires that the LLM receives the identical per-turn environment state (search results, product details, rankings, prices) that real users received. The description of the 31,865 sessions supplies user actions but does not confirm explicit replay of system responses; without state parity, the accuracy deficit could reflect missing context rather than inability to model human decision processes.
- [Results] The abstract and results report concrete accuracy (11.86%) and F1 (33.86%) numbers but supply no definition of how an action is counted as correct, no error bars, and no description of data filtering or statistical tests. This leaves the numerical claims difficult to evaluate and compare across models or conditions.
minor comments (2)
- [Introduction] The related-work discussion could more explicitly contrast the quantitative action-matching metric with prior qualitative believability ratings to clarify the novelty of the benchmark.
- [Figures and Tables] Figure captions and table headers should include the exact definition of 'action accuracy' used in the reported percentages.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript to improve clarity on the evaluation setup and results reporting.
read point-by-point responses
-
Referee: [Evaluation Setup] The central claim of 11.86% accuracy in the abstract and results requires that the LLM receives the identical per-turn environment state (search results, product details, rankings, prices) that real users received. The description of the 31,865 sessions supplies user actions but does not confirm explicit replay of system responses; without state parity, the accuracy deficit could reflect missing context rather than inability to model human decision processes.
Authors: We agree that confirming state parity is essential for interpreting the accuracy results. Our evaluation does replay the exact system responses (search results, product details, rankings, and prices) from the real sessions to the LLM at each turn, using the logged environment states as fixed inputs based on the action sequence. This ensures the model has equivalent information to the human users. We acknowledge the original description was insufficiently explicit on this point and have revised the Evaluation Setup section to state this replay mechanism clearly. revision: yes
-
Referee: [Results] The abstract and results report concrete accuracy (11.86%) and F1 (33.86%) numbers but supply no definition of how an action is counted as correct, no error bars, and no description of data filtering or statistical tests. This leaves the numerical claims difficult to evaluate and compare across models or conditions.
Authors: We appreciate this observation on the need for greater transparency in reporting. In the revised manuscript, we have added: a precise definition of action correctness (exact match on action type and target, such as specific product ID or search term); error bars computed as standard error across sessions; data filtering criteria (e.g., sessions with complete logs and minimum length); and statistical tests (paired t-tests for model comparisons). These details will be included in the Results section. revision: yes
Circularity Check
No circularity detected; results are direct empirical measurements against external data
full rationale
The paper's core results are accuracy metrics (11.86% for prompt-based LLMs, 17.26% for fine-tuned models) obtained by comparing generated actions to held-out real user actions from 31,865 sessions. No step reduces a claimed prediction to a fitted parameter, self-definition, or self-citation chain; the evaluation uses independent ground-truth recordings as the benchmark. Fine-tuning improvements are reported as straightforward empirical deltas on separate data splits. The derivation chain is self-contained and does not rely on any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 31,865 online shopping sessions accurately represent typical multi-turn human behavior in e-commerce settings.
Forward citations
Cited by 5 Pith papers
-
ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents
ShopGym introduces ShopArena to convert live storefronts into self-contained sandbox shops and ShopGuru to synthesize 224 benchmark tasks, with validation showing structural preservation and positive correlation of ag...
-
SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents
SimPersona uses VQ-VAE to induce discrete buyer types from clickstreams, maps them to LLM persona tokens, and fine-tunes agents to achieve 78% conversion-rate alignment with real buyers across 42 storefronts.
-
SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents
SimPersona induces a discrete buyer-type space from clickstreams via VQ-VAE, maps types to LLM persona tokens, fine-tunes agents on traces, and samples from merchant distributions to achieve 78% conversion-rate alignm...
-
Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies
In real human subjects, AI transparency impacts imperfectly cooperative interactions far more than personality traits, unlike simulations where both are comparably influential.
-
Model-Free Assessment of Simulator Fidelity via Quantile Curves
A model-free method builds confidence sets for latent parameters to proxy sim-to-real discrepancies and estimates the quantile function of that proxy to produce a distribution-level fidelity profile for simulators.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.