pith. sign in

arxiv: 2503.20749 · v8 · submitted 2025-03-26 · 💻 cs.CL

Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data

Pith reviewed 2026-05-22 22:10 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM agentshuman behavior simulationonline shoppingmulti-turn interactionaction generation accuracyfine-tuningbehavioral fidelitypurchase prediction
0
0 comments X

The pith

Prompt-based LLMs generate real human shopping actions with only 11.86 percent accuracy across 31,865 sessions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether current LLM agents can produce step-by-step actions that match specific human users during extended shopping interactions. Using records from more than 230,000 real actions, it measures how often model outputs align with what people actually did. Prompt-only versions of models such as DeepSeek-R1, Llama, and Claude reach just 11.86 percent accuracy. Fine-tuning the same models on real click data plus synthesized reasoning traces raises action accuracy to 17.26 percent and purchase prediction F1 to 33.86 percent. These numbers establish a quantitative benchmark that separates qualitative believability from measurable behavioral fidelity.

Core claim

Prompt-based LLMs achieve only 11.86 percent accuracy in generating human actions that match recorded behavior in real online shopping sessions, while fine-tuning on real human click-through data augmented with synthesized reasoning traces improves the fine-tuned Qwen2.5-7B model to 17.26 percent action generation accuracy and 33.86 percent F1 score on final purchase prediction.

What carries the argument

The accuracy metric that counts exact matches between model-generated actions and the 230,965 recorded user actions across 31,865 shopping sessions.

If this is right

  • Fine-tuning on real click data with added reasoning traces produces measurable gains in both action accuracy and purchase prediction.
  • The 5.4 percent and 13.85 percent improvements over prompt baselines show that task-specific training data can narrow the simulation gap.
  • The 31,865-session dataset supplies the first large-scale quantitative benchmark for multi-turn human behavior simulation.
  • Downstream applications that rely on accurate step-by-step human modeling will need methods beyond zero-shot prompting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Shopping may expose weaknesses in current LLMs that other multi-turn tasks such as dialogue or planning do not reveal as clearly.
  • Models that reach higher action accuracy could enable more reliable synthetic user testing for e-commerce interfaces.
  • The remaining gap after fine-tuning suggests that richer state representations or explicit memory mechanisms may still be required.

Load-bearing premise

The logged actions in the collected shopping sessions form a complete and unbiased record of what users would have done under the same conditions.

What would settle it

A replication experiment on the same 31,865 sessions in which any prompt-only LLM reaches above 30 percent exact action match would falsify the reported gap.

read the original abstract

Recent research shows that LLM Agents can generate ``believable'' human behaviors via prompt-only methods, and such agents have been increasingly adopted in downstream applications. However, existing evaluation of these agents only focuses on qualitative believability (whether human raters think they are accurate), leaving open questions of whether LLM agents can accurately generate step-by-step actions mimicking a particular human's behavior in a multi-turn interaction task. In this work, we take shopping as a case study and present the first large-scale quantitative evaluation of state-of-the-art LLMs' ability to accurately simulate human behavior. Using real-world data from 31,865 online shopping sessions containing 230,965 user actions, our evaluation reveals that prompt-based LLMs (DeepSeek-R1, Llama, Claude) achieve only 11.86% accuracy in generating human actions, highlighting a substantial gap in actual behavioral accuracy. Through experiments, we also showcase that strategies as simple as fine-tuning LLMs on real human click-through data augmented with synthesized reasoning traces can greatly enhance models' performance. The fine-tuned Qwen2.5-7B achieves 17.26% action generation accuracy and 33.86% F1 score on final purchase prediction, representing substantial improvements of 5.4% and 13.85% over prompt-only baselines. This work establishes the first rigorous benchmark for human behavior simulation and provides actionable insights for developing more accurate LLM agents for future downstream applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates whether prompt-based LLM agents can accurately simulate multi-turn human shopping behavior, using a dataset of 31,865 real sessions with 230,965 actions as ground truth. It reports that models such as DeepSeek-R1, Llama, and Claude achieve only 11.86% action-generation accuracy, while fine-tuning Qwen2.5-7B on click-through data augmented with synthesized reasoning traces raises accuracy to 17.26% and purchase-prediction F1 to 33.86%. The work presents these results as the first large-scale quantitative benchmark for behavioral simulation fidelity.

Significance. If the evaluation ensures equivalent state information for the LLM at each turn, the reported accuracy gap would indicate that current prompt-only LLMs are substantially limited in replicating specific human decision sequences in interactive settings. The scale of the real-world dataset and the concrete improvement from fine-tuning constitute strengths that could guide future agent development. The result's broader impact on applications such as user modeling depends on confirming that the accuracy metric isolates behavioral modeling rather than information asymmetry.

major comments (2)
  1. [Evaluation Setup] The central claim of 11.86% accuracy in the abstract and results requires that the LLM receives the identical per-turn environment state (search results, product details, rankings, prices) that real users received. The description of the 31,865 sessions supplies user actions but does not confirm explicit replay of system responses; without state parity, the accuracy deficit could reflect missing context rather than inability to model human decision processes.
  2. [Results] The abstract and results report concrete accuracy (11.86%) and F1 (33.86%) numbers but supply no definition of how an action is counted as correct, no error bars, and no description of data filtering or statistical tests. This leaves the numerical claims difficult to evaluate and compare across models or conditions.
minor comments (2)
  1. [Introduction] The related-work discussion could more explicitly contrast the quantitative action-matching metric with prior qualitative believability ratings to clarify the novelty of the benchmark.
  2. [Figures and Tables] Figure captions and table headers should include the exact definition of 'action accuracy' used in the reported percentages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript to improve clarity on the evaluation setup and results reporting.

read point-by-point responses
  1. Referee: [Evaluation Setup] The central claim of 11.86% accuracy in the abstract and results requires that the LLM receives the identical per-turn environment state (search results, product details, rankings, prices) that real users received. The description of the 31,865 sessions supplies user actions but does not confirm explicit replay of system responses; without state parity, the accuracy deficit could reflect missing context rather than inability to model human decision processes.

    Authors: We agree that confirming state parity is essential for interpreting the accuracy results. Our evaluation does replay the exact system responses (search results, product details, rankings, and prices) from the real sessions to the LLM at each turn, using the logged environment states as fixed inputs based on the action sequence. This ensures the model has equivalent information to the human users. We acknowledge the original description was insufficiently explicit on this point and have revised the Evaluation Setup section to state this replay mechanism clearly. revision: yes

  2. Referee: [Results] The abstract and results report concrete accuracy (11.86%) and F1 (33.86%) numbers but supply no definition of how an action is counted as correct, no error bars, and no description of data filtering or statistical tests. This leaves the numerical claims difficult to evaluate and compare across models or conditions.

    Authors: We appreciate this observation on the need for greater transparency in reporting. In the revised manuscript, we have added: a precise definition of action correctness (exact match on action type and target, such as specific product ID or search term); error bars computed as standard error across sessions; data filtering criteria (e.g., sessions with complete logs and minimum length); and statistical tests (paired t-tests for model comparisons). These details will be included in the Results section. revision: yes

Circularity Check

0 steps flagged

No circularity detected; results are direct empirical measurements against external data

full rationale

The paper's core results are accuracy metrics (11.86% for prompt-based LLMs, 17.26% for fine-tuned models) obtained by comparing generated actions to held-out real user actions from 31,865 sessions. No step reduces a claimed prediction to a fitted parameter, self-definition, or self-citation chain; the evaluation uses independent ground-truth recordings as the benchmark. Fine-tuning improvements are reported as straightforward empirical deltas on separate data splits. The derivation chain is self-contained and does not rely on any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the collected shopping sessions provide an objective ground truth for behavioral accuracy; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The 31,865 online shopping sessions accurately represent typical multi-turn human behavior in e-commerce settings.
    The evaluation treats the recorded actions as the reference standard for measuring simulation accuracy.

pith-pipeline@v0.9.0 · 5833 in / 1259 out tokens · 64426 ms · 2026-05-22T22:10:21.602259+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

    cs.AI 2026-05 conditional novelty 7.0

    ShopGym introduces ShopArena to convert live storefronts into self-contained sandbox shops and ShopGuru to synthesize 224 benchmark tasks, with validation showing structural preservation and positive correlation of ag...

  2. SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents

    cs.AI 2026-05 conditional novelty 7.0

    SimPersona uses VQ-VAE to induce discrete buyer types from clickstreams, maps them to LLM persona tokens, and fine-tunes agents to achieve 78% conversion-rate alignment with real buyers across 42 storefronts.

  3. SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    SimPersona induces a discrete buyer-type space from clickstreams via VQ-VAE, maps types to LLM persona tokens, fine-tunes agents on traces, and samples from merchant distributions to achieve 78% conversion-rate alignm...

  4. Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies

    cs.CL 2026-04 unverdicted novelty 5.0

    In real human subjects, AI transparency impacts imperfectly cooperative interactions far more than personality traits, unlike simulations where both are comparably influential.

  5. Model-Free Assessment of Simulator Fidelity via Quantile Curves

    stat.ME 2025-12 unverdicted novelty 5.0

    A model-free method builds confidence sets for latent parameters to proxy sim-to-real discrepancies and estimates the quantile function of that proxy to produce a distribution-level fidelity profile for simulators.