OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation
Pith reviewed 2026-05-22 00:00 UTC · model grok-4.3
The pith
The OPERA dataset captures real users' personas, observations, rationales, and actions in online shopping to benchmark how well LLMs can predict the next step.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OPERA is a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. It is the first public dataset that comprehensively captures user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. Using OPERA, the authors establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona and observation-action-rationale history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.
What carries the argument
The OPERA dataset, built using an online questionnaire and a custom browser plugin, which records four components: user personas, browser observations, fine-grained web actions, and self-reported rationales.
If this is right
- Current LLMs can be tested for their ability to simulate personalized user behavior in web environments.
- Success on this benchmark would support developing LLM-based agents that function as digital twins for individual shoppers.
- The dataset enables research into aligning LLM outputs with human-like reasoning in decision-making tasks.
Where Pith is reading between the lines
- Extending this data collection method to other domains like social media or news consumption could reveal how well LLMs simulate behavior across different contexts.
- If the benchmark shows LLMs struggle with rationale prediction, it might highlight gaps in capturing human decision-making processes.
Load-bearing premise
The data collection via online questionnaire and custom browser plugin produces accurate, unbiased, and representative records of real human reasoning and observable behavior during online shopping sessions.
What would settle it
Running the benchmark and finding that LLMs perform no better than simple baselines or random guessing in predicting next actions and rationales from the provided history would falsify the utility of the dataset for evaluating simulation capabilities.
read the original abstract
Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the OPERA dataset of Observation, Persona, Rationale, and Action tuples collected from real human participants during online shopping sessions via an online questionnaire and custom browser plugin. It claims this is the first public dataset capturing these elements comprehensively and uses it to establish the first benchmark evaluating how well LLMs can predict a specific user's next action and rationale given persona and history.
Significance. If the dataset's rationales and actions accurately reflect concurrent human reasoning rather than post-hoc reconstructions, the work would provide a valuable new resource for benchmarking personalized LLM agents as digital twins in web environments, addressing a noted gap in high-quality behavioral datasets.
major comments (2)
- [Abstract] Abstract and data collection description: no sample size, validation steps, quality metrics, or example data are supplied. This leaves the central claim of high-fidelity capture of real human reasoning unsupported by verifiable evidence and directly affects interpretability of the proposed benchmark.
- [Data Collection] Data collection pipeline: rationales are gathered separately via post-action questionnaire, creating a temporal gap between action and self-report. No validation experiment (e.g., concurrent think-aloud agreement or plugin-log accuracy against screen recordings) is reported to confirm these are 'just-in-time' rather than post-hoc justifications, which is load-bearing for the claim that benchmark scores measure simulation fidelity to actual human decision processes.
minor comments (1)
- [Dataset Description] Clarify the exact format and granularity of the 'fine-grained web actions' and 'browser observations' fields so that benchmark reproducibility is unambiguous.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract and data collection description: no sample size, validation steps, quality metrics, or example data are supplied. This leaves the central claim of high-fidelity capture of real human reasoning unsupported by verifiable evidence and directly affects interpretability of the proposed benchmark.
Authors: We agree that including these details in the abstract would improve the manuscript's clarity and support for our claims. In the revised version, we have updated the abstract to include the sample size (number of participants and shopping sessions), a summary of the data collection and quality control steps, key quality metrics such as inter-rater agreement or completion rates if applicable, and a brief example of an OPERA tuple. These elements were already detailed in the main text but are now highlighted in the abstract for better accessibility and to strengthen the interpretability of the benchmark results. revision: yes
-
Referee: [Data Collection] Data collection pipeline: rationales are gathered separately via post-action questionnaire, creating a temporal gap between action and self-report. No validation experiment (e.g., concurrent think-aloud agreement or plugin-log accuracy against screen recordings) is reported to confirm these are 'just-in-time' rather than post-hoc justifications, which is load-bearing for the claim that benchmark scores measure simulation fidelity to actual human decision processes.
Authors: We acknowledge the temporal separation inherent in our data collection method, where rationales are elicited through an immediate post-action questionnaire to capture reasoning as close as possible to the decision moment. We have revised the manuscript to more precisely describe the timing (questionnaire administered within seconds after each action via the plugin) and to explicitly discuss this as a limitation, noting that while designed to be just-in-time, it may involve some degree of post-hoc reconstruction. Regarding validation experiments, we did not include concurrent think-aloud protocols or direct comparisons with screen recordings in the original study design. We have added a section discussing potential impacts on fidelity and suggest this as an avenue for future work to further validate the dataset. revision: partial
- Empirical validation of the 'just-in-time' nature of rationales through concurrent think-aloud protocols or accuracy checks against screen recordings was not performed in the original data collection and thus cannot be provided without additional experiments.
Circularity Check
No circularity: dataset introduction with no derivation chain
full rationale
The manuscript presents an observational data-collection effort: participants complete an online questionnaire and use a custom browser plugin to record personas, observations, actions, and post-action rationales during shopping sessions. It then defines a benchmark by splitting the collected tuples into history and prediction targets. No equations, fitted parameters, first-principles derivations, or load-bearing self-citations appear. The central claim is simply that the authors gathered and released this particular dataset and evaluation protocol; that claim is true by construction of the collection process itself and does not reduce to any hidden circular step. External validity concerns (e.g., post-hoc rationales) are separate from circularity and are not addressed here.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-reported rationales accurately reflect users' internal reasoning processes at the time of action.
Forward citations
Cited by 4 Pith papers
-
ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents
ShopGym introduces ShopArena to convert live storefronts into self-contained sandbox shops and ShopGuru to synthesize 224 benchmark tasks, with validation showing structural preservation and positive correlation of ag...
-
SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents
SimPersona uses VQ-VAE to induce discrete buyer types from clickstreams, maps them to LLM persona tokens, and fine-tunes agents to achieve 78% conversion-rate alignment with real buyers across 42 storefronts.
-
SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents
SimGym is a browser-based VLM agent framework that simulates A/B test outcomes on e-commerce storefronts with 77% directional agreement on add-to-cart shifts from real buyer traffic.
-
SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents
SimPersona induces a discrete buyer-type space from clickstreams via VQ-VAE, maps types to LLM persona tokens, fine-tunes agents on traces, and samples from merchant distributions to achieve 78% conversion-rate alignm...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.