OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

Amirali Amini; Bo Sun; Dakuo Wang; Jing Huang; Jiri Gesi; Lydia Chilton; Malihe Alikhani; Tian Wang; Toby Jia-Jun Li; Upol Ehsan

arxiv: 2506.05606 · v7 · pith:FRKQKJMGnew · submitted 2025-06-05 · 💻 cs.CL · cs.HC

OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

Ziyi Wang , Yuxuan Lu , Wenbo Li , Amirali Amini , Bo Sun , Yakov Bart , Weimin Lyu , Jiri Gesi

show 8 more authors

Tian Wang Jing Huang Yu Su Upol Ehsan Malihe Alikhani Toby Jia-Jun Li Lydia Chilton Dakuo Wang

This is my paper

Pith reviewed 2026-05-22 00:00 UTC · model grok-4.3

classification 💻 cs.CL cs.HC

keywords OPERA datasetLLM simulationonline shopping behavioruser personaaction predictionrationalebenchmark datasetdigital twins

0 comments

The pith

The OPERA dataset captures real users' personas, observations, rationales, and actions in online shopping to benchmark how well LLMs can predict the next step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OPERA as the first public dataset that records user personas, what users see in their browser, the fine-grained actions they take, and their immediate rationales during actual online shopping. It also sets up a benchmark for testing LLMs on predicting a specific user's next action and rationale given the persona and past history. A sympathetic reader would care because this provides a way to measure if LLMs can act as personalized digital twins that mimic individual human behavior in e-commerce settings rather than generic responses.

Core claim

OPERA is a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. It is the first public dataset that comprehensively captures user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. Using OPERA, the authors establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona and observation-action-rationale history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.

What carries the argument

The OPERA dataset, built using an online questionnaire and a custom browser plugin, which records four components: user personas, browser observations, fine-grained web actions, and self-reported rationales.

If this is right

Current LLMs can be tested for their ability to simulate personalized user behavior in web environments.
Success on this benchmark would support developing LLM-based agents that function as digital twins for individual shoppers.
The dataset enables research into aligning LLM outputs with human-like reasoning in decision-making tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending this data collection method to other domains like social media or news consumption could reveal how well LLMs simulate behavior across different contexts.
If the benchmark shows LLMs struggle with rationale prediction, it might highlight gaps in capturing human decision-making processes.

Load-bearing premise

The data collection via online questionnaire and custom browser plugin produces accurate, unbiased, and representative records of real human reasoning and observable behavior during online shopping sessions.

What would settle it

Running the benchmark and finding that LLMs perform no better than simple baselines or random guessing in predicting next actions and rationales from the provided history would falsify the utility of the dataset for evaluating simulation capabilities.

read the original abstract

Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OPERA gives a new public dataset pairing real-user personas, browser observations, actions, and rationales for LLM shopping simulation benchmarks, but the rationale collection method lacks the validation needed to treat the data as faithful traces of concurrent reasoning.

read the letter

The paper's core contribution is a dataset collected from actual shoppers that bundles persona details, live browser observations, fine-grained actions, and self-reported rationales into one public resource. This combination is not already available in the cited prior work, so it does open a concrete way to test whether LLMs can predict what a specific user will do next and why, given history. They built it with a questionnaire plus custom browser plugin, which is a practical way to gather the four elements together during real sessions, and they frame a benchmark around next-action-plus-rationale prediction. That setup is useful for anyone trying to build or evaluate personalized web agents in e-commerce settings. The data release itself is the main deliverable, and if the logs are clean and the participant pool is decent, it could serve as a starting point for follow-on experiments. The weak point is the rationale capture. The abstract calls them just-in-time, yet the collection happens through a post-action questionnaire, which is the exact setup known to produce reconstructed justifications rather than real-time traces. No validation step is described—no think-aloud comparison, no screen-recording checks, no inter-rater agreement on rationale fidelity. Without that evidence the benchmark scores are hard to interpret as measures of simulation accuracy. Sample size, demographics, and basic quality metrics are also missing from the abstract, so it is difficult to judge how representative the records are. This paper is aimed at researchers working on LLM agents for web interaction or human-behavior modeling. A reader who needs a ready-made dataset for that niche will find it worth downloading and inspecting. I would send it to peer review; the dataset idea is straightforward and the benchmark framing is clear, but referees will have to press on the data-quality controls before the claims about faithful human simulation can be taken at face value.

Referee Report

2 major / 1 minor

Summary. The paper introduces the OPERA dataset of Observation, Persona, Rationale, and Action tuples collected from real human participants during online shopping sessions via an online questionnaire and custom browser plugin. It claims this is the first public dataset capturing these elements comprehensively and uses it to establish the first benchmark evaluating how well LLMs can predict a specific user's next action and rationale given persona and history.

Significance. If the dataset's rationales and actions accurately reflect concurrent human reasoning rather than post-hoc reconstructions, the work would provide a valuable new resource for benchmarking personalized LLM agents as digital twins in web environments, addressing a noted gap in high-quality behavioral datasets.

major comments (2)

[Abstract] Abstract and data collection description: no sample size, validation steps, quality metrics, or example data are supplied. This leaves the central claim of high-fidelity capture of real human reasoning unsupported by verifiable evidence and directly affects interpretability of the proposed benchmark.
[Data Collection] Data collection pipeline: rationales are gathered separately via post-action questionnaire, creating a temporal gap between action and self-report. No validation experiment (e.g., concurrent think-aloud agreement or plugin-log accuracy against screen recordings) is reported to confirm these are 'just-in-time' rather than post-hoc justifications, which is load-bearing for the claim that benchmark scores measure simulation fidelity to actual human decision processes.

minor comments (1)

[Dataset Description] Clarify the exact format and granularity of the 'fine-grained web actions' and 'browser observations' fields so that benchmark reproducibility is unambiguous.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract and data collection description: no sample size, validation steps, quality metrics, or example data are supplied. This leaves the central claim of high-fidelity capture of real human reasoning unsupported by verifiable evidence and directly affects interpretability of the proposed benchmark.

Authors: We agree that including these details in the abstract would improve the manuscript's clarity and support for our claims. In the revised version, we have updated the abstract to include the sample size (number of participants and shopping sessions), a summary of the data collection and quality control steps, key quality metrics such as inter-rater agreement or completion rates if applicable, and a brief example of an OPERA tuple. These elements were already detailed in the main text but are now highlighted in the abstract for better accessibility and to strengthen the interpretability of the benchmark results. revision: yes
Referee: [Data Collection] Data collection pipeline: rationales are gathered separately via post-action questionnaire, creating a temporal gap between action and self-report. No validation experiment (e.g., concurrent think-aloud agreement or plugin-log accuracy against screen recordings) is reported to confirm these are 'just-in-time' rather than post-hoc justifications, which is load-bearing for the claim that benchmark scores measure simulation fidelity to actual human decision processes.

Authors: We acknowledge the temporal separation inherent in our data collection method, where rationales are elicited through an immediate post-action questionnaire to capture reasoning as close as possible to the decision moment. We have revised the manuscript to more precisely describe the timing (questionnaire administered within seconds after each action via the plugin) and to explicitly discuss this as a limitation, noting that while designed to be just-in-time, it may involve some degree of post-hoc reconstruction. Regarding validation experiments, we did not include concurrent think-aloud protocols or direct comparisons with screen recordings in the original study design. We have added a section discussing potential impacts on fidelity and suggest this as an avenue for future work to further validate the dataset. revision: partial

standing simulated objections not resolved

Empirical validation of the 'just-in-time' nature of rationales through concurrent think-aloud protocols or accuracy checks against screen recordings was not performed in the original data collection and thus cannot be provided without additional experiments.

Circularity Check

0 steps flagged

No circularity: dataset introduction with no derivation chain

full rationale

The manuscript presents an observational data-collection effort: participants complete an online questionnaire and use a custom browser plugin to record personas, observations, actions, and post-action rationales during shopping sessions. It then defines a benchmark by splitting the collected tuples into history and prediction targets. No equations, fitted parameters, first-principles derivations, or load-bearing self-citations appear. The central claim is simply that the authors gathered and released this particular dataset and evaluation protocol; that claim is true by construction of the collection process itself and does not reduce to any hidden circular step. External validity concerns (e.g., post-hoc rationales) are separate from circularity and are not addressed here.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that self-reported rationales and plugin-captured observations faithfully represent internal user reasoning and behavior; no free parameters or invented entities are introduced because the contribution is data collection rather than modeling.

axioms (1)

domain assumption Self-reported rationales accurately reflect users' internal reasoning processes at the time of action.
The dataset construction depends on participants providing honest and precise explanations immediately after each web action.

pith-pipeline@v0.9.0 · 5791 in / 1300 out tokens · 54552 ms · 2026-05-22T00:00:17.497555+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents
cs.AI 2026-05 conditional novelty 7.0

ShopGym introduces ShopArena to convert live storefronts into self-contained sandbox shops and ShopGuru to synthesize 224 benchmark tasks, with validation showing structural preservation and positive correlation of ag...
SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents
cs.AI 2026-05 conditional novelty 7.0

SimPersona uses VQ-VAE to induce discrete buyer types from clickstreams, maps them to LLM persona tokens, and fine-tunes agents to achieve 78% conversion-rate alignment with real buyers across 42 storefronts.
SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

SimGym is a browser-based VLM agent framework that simulates A/B test outcomes on e-commerce storefronts with 77% directional agreement on add-to-cart shifts from real buyer traffic.
SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents
cs.AI 2026-05 unverdicted novelty 6.0

SimPersona induces a discrete buyer-type space from clickstreams via VQ-VAE, maps types to LLM persona tokens, fine-tunes agents on traces, and samples from merchant distributions to achieve 78% conversion-rate alignm...