pith. sign in

arxiv: 2604.08362 · v2 · pith:ZSEJRDVMnew · submitted 2026-04-09 · 💻 cs.CL · cs.AI· cs.LG

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Pith reviewed 2026-05-22 10:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords LLM user simulationbehavior benchmarkstructural biaspersona homogenizationlong-horizon tracesreal-world datautopian bias
0
0 comments X

The pith

LLMs simulating real human behavior converge toward a positive average person and erase individual differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds OmniBehavior, a benchmark drawn entirely from real-world traces, to test how well large language models can act as user simulators across long sequences that cross multiple life scenarios. It first shows that prior benchmarks using isolated or synthetic settings miss the causal chains that link decisions over time in actual human lives. When state-of-the-art models are evaluated on the new benchmark, they produce behaviors that are more active, more uniform across people, and more optimistic than the source data. The resulting structural bias removes the variability and infrequent patterns that define real individuals, limiting how faithfully any downstream application can replay or predict human actions.

Core claim

A systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a utopian bias. This results in the loss of individual differences and long-tail behaviors.

What carries the argument

The OmniBehavior benchmark, which assembles long-horizon, cross-scenario, and heterogeneous behavioral patterns directly from real-world data to serve as ground truth for simulation fidelity.

If this is right

  • Isolated-scenario datasets create tunnel vision that hides the cross-scenario causal chains present in real decision-making.
  • LLM simulation performance plateaus even when context windows are enlarged.
  • The structural bias produces outputs that systematically omit the low-frequency behaviors observed in authentic traces.
  • High-fidelity simulation will require explicit mechanisms to preserve individual differences rather than defaulting to an averaged persona.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same convergence may appear in other generative tasks that rely on modeling user preferences or sequences, such as personalized recommendation or dialogue systems.
  • A practical extension would be to add explicit regularization or retrieval steps that force models to reproduce measured frequencies of rare actions from the source traces.
  • Testing whether the bias persists when models are given explicit negative or low-activity examples from the same data would clarify whether the issue is data scarcity or architectural.
  • If the homogenization is confirmed across multiple languages or cultures, it would indicate a training-data skew rather than a language-specific artifact.

Load-bearing premise

The real-world behavioral traces collected for OmniBehavior accurately and representatively capture authentic long-horizon, cross-scenario human decision-making without significant selection or measurement biases.

What would settle it

Collect a fresh set of long-horizon traces from a demographically different population that explicitly includes documented long-tail decisions, then measure whether LLM outputs still flatten those decisions into positive averages.

Figures

Figures reproduced from arXiv: 2604.08362 by Boxi Cao, Hongyu Lin, Jiawei Chen, Le Sun, Ruotong Pan, Ruoxi Xu, Tingting Gao, Xiangyu Wu, Xianpei Han, Yaojie Lu, Yifei Hu, Yingfei Sun, Yong Du, Yunfei Zhang.

Figure 1
Figure 1. Figure 1: Overview of OmniBehavior, a real-world comprehensive benchmark for evaluating LLM-based user simulators. The benchmark is constructed in three stages: (1) Data Collection: aggregation of real-world logs from the Kuaishou platform across several major scenarios, with Customer Service treated as part of the E-commerce scenario, yielding five scenarios in total after aggregation. (2) Data Processing: multi-mo… view at source ↗
Figure 2
Figure 2. Figure 2: User profile reconstruction based on Single-scenario vs. Multi-scenario data. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cumulative interest coverage with in￾creasing scenarios. Determining how to construct a dataset that fully captures the decision-making process requires a deep understanding of user causal chains. To this end, we sample 180 high-value conversion events (e.g., “Purchase”) and trace the full his￾torical interaction paths leading to final outcome using Claude Sonnet-4.5 [5]. All traced paths are further manua… view at source ↗
Figure 4
Figure 4. Figure 4: Distributions of causal chain spans. (a) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Case study of a cross-scenario causal chain, in which a search-initiated interest in “Xiaomi” [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Our real dataset (OmniBehavior) inter￾ests evolve smoothly, while synthetic dataset (Lo￾CoMo) interests show rigid, task-driven spikes. To investigate whether synthetic user simula￾tion data can accurately reproduce the complex dynamics of real user interest evolution, we conduct a comparative analysis between authen￾tic and synthetic user trajectories. Specifically, we compare OmniBehavior with LoCoMo [28… view at source ↗
Figure 7
Figure 7. Figure 7: Effect of Context Window Size. To evaluate LLMs’ capability to model long in￾teraction histories, we test representative open￾source and closed-source LLMs across varying context window sizes from 16K to 128K tokens. The experiments are conducted on a specific user subset (N = 66) with interaction histories exceeding 128K tokens to ensure sufficient data for scaling analysis. As shown in [PITH_FULL_IMAGE:… view at source ↗
Figure 8
Figure 8. Figure 8: Effect of Memory Management. To further examine whether commonly used context management mechanisms can alleviate the above limitations, we compare the perfor￾mance of two representative memory manage￾ment approaches based on Qwen3-235B. Both approaches operate on the full user interaction history. For summarization-based method, we periodically summarize the history whenever it reaches a 4k-token buffer. … view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of positive interaction rates between real users and LLM-based sim￾ulators across scenarios. LLM-generated be￾haviors show substantially higher positive rates, revealing a systematic hyper-activity bias. We first compare real and simulated user behaviors at the distribution level by measuring the positive prediction rate, defined as the proportion of positive outcomes among all interactions. We … view at source ↗
Figure 10
Figure 10. Figure 10: Sentiment distribution of real users and [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Language style comparison between real users and LLM-simulated users. LLM￾generated utterances exhibit higher levels of polite￾ness markers, hedging, and face-saving strategies, indicating a systematic tendency towards overly polite and non-confrontational language. Language Style. Complementary analysis of language style further reveals systematic dif￾ferences in how users express dissatisfaction. Follow… view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of Intra-user and Inter￾user behavioral distances for Human and LLM￾simulated users. Real users exhibit significantly larger inter-user variation than intra-user variation, whereas LLM-generated users show heavily over￾lapping distributions, indicating a pronounced ten￾dency toward persona homogenization. The above findings raise the question of whether LLM-based simulators preserve personalize… view at source ↗
Figure 13
Figure 13. Figure 13: Log-scaled distribution of user ac￾tion sequence lengths, spanning hundreds to over 100k, requiring models to handle ultra-long con￾texts. To provide a clearer picture of OmniBehavior’s composition, we report additional statistics on user behavior sequences and population attributes [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Demographic and behavioral distributions of users in the benchmark. The charts show [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The OmniBehavior Benchmark Scope. We construct a unified simulation environment [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Vocabulary comparison of real users (left) and LLM-based simulators (right). The LLM [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Intra-user vs. Inter-user behavioral distance distributions for all evaluated models. [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompt for binary value prediction in the main experiment. [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Prompt for extracting and understanding key information from live streaming cover [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Prompt for continuous value prediction in the main experiment. [PITH_FULL_IMAGE:figures/full_fig_p021_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Prompt for sentiment classification in Utopian Tendency experiment. [PITH_FULL_IMAGE:figures/full_fig_p021_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Prompt for text value prediction in the main experiment. [PITH_FULL_IMAGE:figures/full_fig_p022_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Prompt for identifying an item’s interest categories and keywords. [PITH_FULL_IMAGE:figures/full_fig_p022_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Prompt for raw data cleaning. You are a user behavior analysis expert focused on causal inference. Your task is to analyze a user’s historical interaction sequence, identify the key causal events that lead to a target behavior, and explain their roles. Input data includes the user history sequence and the target behavior. Task description: Review the entire history and identify key events that meaningfull… view at source ↗
Figure 25
Figure 25. Figure 25: Prompt for causal chain identification. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Prompt for language style comparison. E Case Study To facilitate an intuitive understanding of our evaluation pipeline, we provide qualitative examples across multiple representative application scenarios. Specifically, we select four real-world settings: behavior prediction in live-streaming ( [PITH_FULL_IMAGE:figures/full_fig_p024_26.png] view at source ↗
read the original abstract

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces OmniBehavior, a benchmark for LLM-based user simulation constructed entirely from real-world long-horizon, cross-scenario, and heterogeneous behavioral traces. It argues that existing isolated-scenario datasets suffer from tunnel vision compared to real-world causal chains, shows that state-of-the-art LLMs struggle to simulate these behaviors with performance plateauing despite larger context windows, and identifies a structural bias in LLMs toward a positive average person, manifested as hyper-activity, persona homogenization, and utopian bias that erases individual differences and long-tail behaviors.

Significance. If the empirical comparisons and bias findings hold after rigorous validation of the ground-truth data, this work would be significant for advancing user simulation research in NLP and HCI. It provides the first unified real-world benchmark beyond synthetic or narrow scenarios and surfaces concrete failure modes (homogenization, loss of long-tail events) that could guide mitigation strategies in generative behavior modeling.

major comments (2)
  1. [Dataset construction / OmniBehavior description] The manuscript states that OmniBehavior is 'constructed entirely from real-world data' and that systematic differences reveal LLM structural bias, but supplies no details on recruitment, logging completeness, demographic coverage, or handling of missing long-tail events. This is load-bearing for the central claim because the reported convergence to a positive average person and loss of individual differences could arise from selection or measurement biases in the reference traces rather than an intrinsic LLM property.
  2. [Evaluation and results] The abstract reports 'extensive evaluations' of LLMs, performance plateauing with context expansion, and a 'fundamental structural bias,' yet provides no metrics (e.g., behavioral divergence, accuracy on action sequences), statistical tests, data scale (number of users/traces), or controls. Without these, the evidence for both the simulation failures and the specific bias patterns (hyper-activity, homogenization, utopian bias) cannot be assessed for robustness.
minor comments (1)
  1. [Abstract / bias analysis] Clarify the precise definition of 'utopian bias' and 'positive average person' with concrete examples from the traces to avoid ambiguity in interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments in detail below and have prepared revisions to improve the clarity and completeness of the paper.

read point-by-point responses
  1. Referee: [Dataset construction / OmniBehavior description] The manuscript states that OmniBehavior is 'constructed entirely from real-world data' and that systematic differences reveal LLM structural bias, but supplies no details on recruitment, logging completeness, demographic coverage, or handling of missing long-tail events. This is load-bearing for the central claim because the reported convergence to a positive average person and loss of individual differences could arise from selection or measurement biases in the reference traces rather than an intrinsic LLM property.

    Authors: We agree that providing more details on dataset construction is important for validating our claims. In the revised version of the manuscript, we will expand the relevant section to include information on recruitment procedures, logging completeness, demographic coverage summaries, and methods for handling missing long-tail events. This will help demonstrate that the observed biases are not artifacts of data collection biases. We note that ethical and privacy considerations limit the extent of detail we can provide on individual participants. revision: yes

  2. Referee: [Evaluation and results] The abstract reports 'extensive evaluations' of LLMs, performance plateauing with context expansion, and a 'fundamental structural bias,' yet provides no metrics (e.g., behavioral divergence, accuracy on action sequences), statistical tests, data scale (number of users/traces), or controls. Without these, the evidence for both the simulation failures and the specific bias patterns (hyper-activity, homogenization, utopian bias) cannot be assessed for robustness.

    Authors: The full paper contains these metrics and details in the experiments and results sections. To address the concern about accessibility, we will update the abstract to briefly mention key quantitative findings and include a summary of the evaluation metrics, statistical tests, data scale, and controls in the main text or a new table. This revision will make the evidence more readily assessable while preserving the paper's structure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark rests on external real-world traces

full rationale

The paper introduces OmniBehavior as a benchmark constructed from real-world data and evaluates LLMs via direct comparisons of simulated versus authentic long-horizon behaviors. No equations, derivations, fitted parameters, or self-citations appear in the provided text as load-bearing elements for the central claims. The reported structural biases (hyper-activity, homogenization, utopian bias) are presented as outcomes of systematic differences against the external reference traces rather than reducing to any input by construction. This is a standard empirical setup self-contained against external benchmarks, consistent with the default non-circular finding for such papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest primarily on the domain assumption that the assembled real-world traces form a faithful, unbiased representation of holistic human behavior; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Real-world data can be integrated into a unified framework capturing long-horizon, cross-scenario, and heterogeneous behavioral patterns without tunnel vision.
    Invoked to justify why prior isolated-scenario datasets are insufficient and why the new benchmark reveals true LLM limitations.

pith-pipeline@v0.9.0 · 5785 in / 1357 out tokens · 66873 ms · 2026-05-22T10:26:50.614032+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel; Jcost_pos_of_ne_one echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a utopian bias. This results in the loss of individual differences and long-tail behaviors.

  • IndisputableMonolith/Cost.lean Jcost_unit0; Jcost_pos_of_ne_one echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Real human behavior is inherently sparse, with positive interaction rates remaining below 10%. By contrast, all evaluated LLM-based simulators exhibit a hyper-activity bias.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 11 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Introducing claude 4

    Anthropic. Introducing claude 4. https://www.anthropic.com/news/claude-4, May 2025

  3. [3]

    Introducing claude haiku 4.5

    Anthropic. Introducing claude haiku 4.5. https://www.anthropic.com/news/claude-haiku-4-5, October 2025

  4. [4]

    Introducing claude opus 4.5

    Anthropic. Introducing claude opus 4.5. https://www.anthropic.com/news/claude-opus-4-5, November 2025

  5. [5]

    Introducing claude sonnet 4.5

    Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/claude-sonnet-4-5, September 2025

  6. [6]

    Designing economic agents that act like human agents: A behavioral approach to bounded rationality.The American economic review, 81(2):353–359, 1991

    W Brian Arthur. Designing economic agents that act like human agents: A behavioral approach to bounded rationality.The American economic review, 81(2):353–359, 1991

  7. [7]

    The netflix prize

    James Bennett and Stan Lanning. The netflix prize. 2007

  8. [8]

    Simulations in recommender systems: An industry perspective.arXiv preprint arXiv:2109.06723, 2021

    Lucas Bernardi, Sakshi Batra, and Cintia Alicia Bruscantini. Simulations in recommender systems: An industry perspective.arXiv preprint arXiv:2109.06723, 2021

  9. [9]

    Simuser: Simulating user behavior with large language models for recommender system evaluation

    Nicolas Bougie and Narimawa Watanabe. Simuser: Simulating user behavior with large language models for recommender system evaluation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 43–60, 2025

  10. [10]

    ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

    Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023

  11. [11]

    Consistentchat: Building skeleton-guided consistent multi-turn dialogues for large language models from scratch

    Jiawei Chen, Xinyan Guan, Qianhao Yuan, Guozhao Mo, Weixiang Zhou, Yaojie Lu, Hongyu Lin, Ben He, Le Sun, and Xianpei Han. Consistentchat: Building skeleton-guided consistent multi-turn dialogues for large language models from scratch. InThe 2025 Conference on Empirical Methods in Natural Language Processing, 2025. 12

  12. [12]

    Simulation of individual and group behavior.The American Economic Review, pages 920–932, 1960

    Geoffrey PE Clarkson and Herbert A Simon. Simulation of individual and group behavior.The American Economic Review, pages 920–932, 1960

  13. [13]

    A computational approach to politeness with application to social factors

    Cristian Danescu, Niculescu Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christo- pher Potts. A computational approach to politeness with application to social factors. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 250–259, 2013

  14. [14]

    Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024

    Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024

  15. [15]

    Kuairec: A fully-observed dataset and insights for evaluating recommender systems

    Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. Kuairec: A fully-observed dataset and insights for evaluating recommender systems. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 540–550, 2022

  16. [16]

    Gemini 3 flash: frontier intelligence built for speed

    Google. Gemini 3 flash: frontier intelligence built for speed. https://blog.google/products-and- platforms/products/gemini/gemini-3-flash/, December 2025

  17. [17]

    The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015

    F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015

  18. [18]

    SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

    Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, and Paul Röttger. Simbench: Benchmarking the ability of large language models to simulate human behaviors. arXiv preprint arXiv:2510.17516, 2025

  19. [19]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  20. [20]

    Recsim: A configurable simulation platform for recommender systems

    Eugene Ie, Chih-wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. Recsim: A configurable simulation platform for recommender systems. arXiv preprint arXiv:1909.04847, 2019

  21. [21]

    Surrealdriver: Designing llm-powered generative driver agent framework based on human drivers’ driving-thinking data,

    Ye Jin, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue Zhou, and Jiangtao Gong. Surrealdriver: Designing generative driver agent simulation framework in urban contexts based on large language model.arXiv preprint arXiv:2309.13193, 5(7):8, 2023

  22. [22]

    Sim4ia-bench: A user simulation benchmark suite for next query and utterance prediction.arXiv preprint arXiv:2511.09329, 2025

    Andreas Konstantin Kruff, Christin Katharina Kreutz, Timo Breuer, Philipp Schaer, and Krisz- tian Balog. Sim4ia-bench: A user simulation benchmark suite for next query and utterance prediction.arXiv preprint arXiv:2511.09329, 2025

  23. [23]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  24. [24]

    Artificial intelligence and simulation: An introduction

    R Greer Lavery. Artificial intelligence and simulation: An introduction. InProceedings of the 18th conference on Winter simulation, pages 448–452, 1986

  25. [25]

    Field theory in social science: selected theoretical papers (edited by dorwin cartwright.)

    Kurt Lewin. Field theory in social science: selected theoretical papers (edited by dorwin cartwright.). 1951

  26. [26]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  27. [27]

    Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024. 13

  28. [28]

    Evaluating Very Long-Term Conversational Memory of LLM Agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024

  29. [29]

    Langchain

    Vasilios Mavroudis. Langchain. 2024

  30. [30]

    The place of modeling in cognitive science.Topics in Cognitive Science, 1(1):11–38, 2009

    James L McClelland. The place of modeling in cognitive science.Topics in Cognitive Science, 1(1):11–38, 2009

  31. [31]

    Some methods of classification and analysis of multivariate observations

    James B McQueen. Some methods of classification and analysis of multivariate observations. InProc. of 5th Berkeley Symposium on Math. Stat. and Prob., pages 281–297, 1967

  32. [32]

    Unveiling the truth and facilitating change: To- wards agent-based large-scale social movement simulation

    X Mou, Z Wei, and X Huang. Unveiling the truth and facilitating change: Towards agent-based large-scale social movement simulation. arxiv 2024.arXiv preprint arXiv:2402.16333

  33. [33]

    Cognitive science and artificial intelligence: simulating the human mind and its complexity.Cognitive Computation and Systems, 1(4):113–116, 2019

    Mohd Naveed Uddin. Cognitive science and artificial intelligence: simulating the human mind and its complexity.Cognitive Computation and Systems, 1(4):113–116, 2019

  34. [34]

    OpenAI. Gpt-5.2. https://openai.com/zh-Hans-CN/index/introducing-gpt-5-2/, 2025

  35. [35]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

  36. [36]

    Artificial intelligence and virtual worlds–toward human-level ai agents

    Vladimir M Petrovi´c. Artificial intelligence and virtual worlds–toward human-level ai agents. IEEE Access, 6:39976–39988, 2018

  37. [37]

    Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction

    Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. InProceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 2685–2692, 2020

  38. [38]

    Computational politeness in natural language processing: A survey.ACM Computing Surveys, 56(9):1–42, 2024

    Priyanshu Priya, Mauajama Firdaus, and Asif Ekbal. Computational politeness in natural language processing: A survey.ACM Computing Surveys, 56(9):1–42, 2024

  39. [39]

    KuaiLive: A Real-time Interactive Dataset for Live Streaming Recommendation

    Changle Qu, Sunhao Dai, Ke Guo, Liqin Zhao, Yanan Niu, Xiao Zhang, and Jun Xu. Kuailive: A real-time interactive dataset for live streaming recommendation.arXiv preprint arXiv:2508.05633, 2025

  40. [40]

    A plea for (good) simulations: nudging economics toward an experimental science

    Julian Reiss. A plea for (good) simulations: nudging economics toward an experimental science. Simulation & gaming, 42(2):243–264, 2011

  41. [41]

    Bases: Large-scale web search user simulation with large language model based agents.arXiv preprint arXiv:2402.17505, 2024

    Ruiyang Ren, Peng Qiu, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Hua Wu, Ji-Rong Wen, and Haifeng Wang. Bases: Large-scale web search user simulation with large language model based agents.arXiv preprint arXiv:2402.17505, 2024

  42. [42]

    Towards Understanding Sycophancy in Language Models

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023

  43. [43]

    Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning

    Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and An-Xiang Zeng. Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4902–4909, 2019

  44. [44]

    Integrating llm in agent-based social simulation: Opportunities and challenges.arXiv preprint arXiv:2507.19364, 2025

    Patrick Taillandier, Jean Daniel Zucker, Arnaud Grignard, Benoit Gaudou, Nghi Quang Huynh, and Alexis Drogoul. Integrating llm in agent-based social simulation: Opportunities and challenges.arXiv preprint arXiv:2507.19364, 2025

  45. [45]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. 14

  46. [46]

    User behavior simulation with large language model based agents

    Lei Wang, Jingsen Zhang, Hao Yang, Zhiyuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Ruihua Song, Wayne Xin Zhao, et al. When large language model based agent meets user behavior analysis: A novel user simulation paradigm.arXiv preprint arXiv:2306.02552, 2023

  47. [47]

    Characterbox: Evaluating the role-playing capabilities of llms in text-based virtual worlds

    Lei Wang, Jianxun Lian, Yi Huang, Yanqi Dai, Haoxuan Li, Xu Chen, Xing Xie, and Ji-Rong Wen. Characterbox: Evaluating the role-playing capabilities of llms in text-based virtual worlds. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long P...

  48. [48]

    Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models

    Noah Wang, Zy Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, et al. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 14743–14777, 2024

  49. [49]

    A survey on llm-based agents for social simulation: Taxonomy, evaluation and applications

    Zixu Wang, Bin Xie, Bingbing Xu, Shengmao Zhu, Yige Yuan, Liang Pang, Long Yang Du Su, Zixuan Li, Huawei Shen, and Xueqi Cheng. A survey on llm-based agents for social simulation: Taxonomy, evaluation and applications

  50. [50]

    Human simulacra: Benchmarking the personification of large language models.arXiv preprint arXiv:2402.18180, 2024

    Qiuejie Xie, Qiming Feng, Tianqi Zhang, Qingqiu Li, Linyi Yang, Yuejie Zhang, Rui Feng, Liang He, Shang Gao, and Yue Zhang. Human simulacra: Benchmarking the personification of large language models.arXiv preprint arXiv:2402.18180, 2024

  51. [51]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  52. [52]

    An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383, 2025

  53. [53]

    Evaluating large language models as generative user simulators for conversational recommendation

    Se-eun Yoon, Zhankui He, Jessica Echterhoff, and Julian McAuley. Evaluating large language models as generative user simulators for conversational recommendation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1490–1504, 2024

  54. [54]

    Tenrec: A large-scale multipurpose benchmark dataset for rec- ommender systems.Advances in Neural Information Processing Systems, 35:11480–11493, 2022

    Guanghu Yuan, Fajie Yuan, Yudong Li, Beibei Kong, Shujie Li, Lei Chen, Min Yang, Chenyun Yu, Bo Hu, Zang Li, et al. Tenrec: A large-scale multipurpose benchmark dataset for rec- ommender systems.Advances in Neural Information Processing Systems, 35:11480–11493, 2022

  55. [55]

    Glm-4.7: Advancing the coding capability

    Z.ai. Glm-4.7: Advancing the coding capability. https://z.ai/blog/glm-4.7, December 2025

  56. [56]

    Agentcf: Collaborative learning with autonomous language agents for recommender systems

    Junjie Zhang, Yupeng Hou, Ruobing Xie, Wenqi Sun, Julian McAuley, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. Agentcf: Collaborative learning with autonomous language agents for recommender systems. InProceedings of the ACM Web Conference 2024, pages 3679–3689, 2024

  57. [57]

    Ai-salesman: Towards reliable large language model driven telemarketing

    Qingyu Zhang, Chunlei Xin, Xuanang Chen, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Qing Ye, Qianlong Xie, and Xingxing Wang. Ai-salesman: Towards reliable large language model driven telemarketing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34790–34798, 2026

  58. [58]

    Evaluating conversational recommender systems via user simulation

    Shuo Zhang and Krisztian Balog. Evaluating conversational recommender systems via user simulation. InProceedings of the 26th acm sigkdd international conference on knowledge discovery & data mining, pages 1512–1520, 2020

  59. [59]

    The jddc 2.0 corpus: A large-scale multimodal multi-turn chinese dialogue dataset for e-commerce customer service

    Nan Zhao, Haoran Li, Youzheng Wu, Xiaodong He, and Bowen Zhou. The jddc 2.0 corpus: A large-scale multimodal multi-turn chinese dialogue dataset for e-commerce customer service. arXiv preprint arXiv:2109.12913, 2021

  60. [60]

    Could you

    Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1059–1068, 2018. 15 A Data Statistics A.1 Action Sequence Length Distribution F...

  61. [61]

    Live streaming Type

    Live streaming type: What type of live streaming is this (E-commerce / gaming / chatting / talent performance, etc.) 2. Host characteristics: The host’s basic appearance features 3. Image text: Extract key text from the cover (**Note: Only extract core text such as live streaming title, product names, prices, promotional information, etc. Do NOT extract b...

  62. [62]

    Beauty,” “Games,

    One concise category (Category), such as “Beauty,” “Games,” “News,” etc

  63. [63]

    text" Output Format: Return only a JSON object containing two fields:

    Three specific keywords (Keywords). Ignore the interactive form of the text. Even if it is casual chat between friends, look beyond the social surface and identify the underlying topic being discussed. Content: "text" Output Format: Return only a JSON object containing two fields: "category" and "keywords" (a list of strings). Example: "category": "Techno...

  64. [64]

    uh,” “ah,

    Noise Removal: * Remove meaningless garbled characters (e.g., AC:BU526, IC·BQ528, within 50 meters, and other interfering information). * Filter excessively redundant filler words, such as repeated occurrences of “uh,” “ah,” “that is to say,” retaining only those necessary for context

  65. [65]

    cumin cowhide

    Semantic Correction: * Correct obvious recognition errors (e.g., change “cumin cowhide” to “naturally revealed,” or infer based on context; if the correct meaning cannot be determined, keep the original). * Complete broken sentences and add commas, periods, or question marks appropriately based on tone and emphasis. 4. Formatting Standards: * Unify full-w...