arxiv: 2604.08362 · v1 · submitted 2026-04-09 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Jiawei Chen , Ruoxi Xu , Boxi Cao , Ruotong Pan , Yunfei Zhang , Yifei Hu , Yong Du , Tingting Gao

show 6 more authors

Yaojie Lu Yingfei Sun Xianpei Han Le Sun Xiangyu Wu Hongyu Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords user simulationlarge language modelsbehavioral tracesbenchmarkingsimulation biaslong-horizon behaviorcross-scenario decision making

0 comments

The pith

Large language models simulating human behavior converge to a positive average person and erase individual differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds OmniBehavior, a benchmark drawn entirely from real-world traces, to test whether large language models can reproduce authentic long-horizon and cross-scenario human decisions. It first demonstrates that earlier benchmarks limited to single scenarios miss the causal chains that connect actions across time and contexts in actual behavior. Evaluations then show that current models underperform at matching these traces, and that their outputs systematically flatten into hyper-active, uniform, and unrealistically positive patterns. This convergence erases personal variation and rare behaviors, which would matter for any application that needs faithful user modeling rather than averaged idealization.

Core claim

Using the OmniBehavior benchmark built from real-world long-horizon, cross-scenario, and heterogeneous behavioral traces, state-of-the-art LLMs are shown to struggle at accurate simulation even as context length grows, and direct comparison to authentic data reveals a structural bias in which models converge toward a positive average person, producing hyper-activity, persona homogenization, and Utopian bias that erases individual differences and long-tail behaviors.

What carries the argument

The OmniBehavior benchmark, which unifies long-horizon, cross-scenario, and heterogeneous real-world behavioral traces into a single evaluation framework for comparing simulated outputs against authentic decision sequences.

If this is right

Prior isolated-scenario benchmarks produce tunnel vision that does not reflect how real decisions form across linked scenarios over time.
LLM simulation performance plateaus rather than improves once context windows are enlarged.
High-fidelity user simulation will require targeted methods to restore individual differences and long-tail behaviors.
Applications relying on behavioral simulation inherit the same convergence to averaged positive patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The bias may stem from training data that over-represents normative or desirable outcomes, suggesting targeted data augmentation with diverse real traces as one remedy.
Downstream tasks such as agent-based social modeling or predictive user interfaces would systematically under-represent risk-taking or atypical choices.
Repeating the comparison on non-LLM simulators or on future model families could isolate whether the convergence is specific to current autoregressive architectures.

Load-bearing premise

The collected real-world behavioral traces represent holistic human decision-making without meaningful collection or annotation biases, and the metrics for activity, homogenization, and positivity accurately reflect model limitations rather than benchmark artifacts.

What would settle it

Re-running the same LLM simulations on the benchmark traces and finding that activity levels, persona variance, and outcome positivity distributions match the real data within measurement error.

Figures

Figures reproduced from arXiv: 2604.08362 by Boxi Cao, Hongyu Lin, Jiawei Chen, Le Sun, Ruotong Pan, Ruoxi Xu, Tingting Gao, Xiangyu Wu, Xianpei Han, Yaojie Lu, Yifei Hu, Yingfei Sun, Yong Du, Yunfei Zhang.

**Figure 1.** Figure 1: Overview of OmniBehavior, a real-world comprehensive benchmark for evaluating LLM-based user simulators. The benchmark is constructed in three stages: (1) Data Collection: aggregation of real-world logs from the Kuaishou platform across several major scenarios, with Customer Service treated as part of the E-commerce scenario, yielding five scenarios in total after aggregation. (2) Data Processing: multi-mo… view at source ↗

**Figure 2.** Figure 2: User profile reconstruction based on Single-scenario vs. Multi-scenario data. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Cumulative interest coverage with increasing scenarios. Determining how to construct a dataset that fully captures the decision-making process requires a deep understanding of user causal chains. To this end, we sample 180 high-value conversion events (e.g., “Purchase”) and trace the full historical interaction paths leading to final outcome using Claude Sonnet-4.5 [5]. All traced paths are further manua… view at source ↗

**Figure 4.** Figure 4: Distributions of causal chain spans. (a) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Case study of a cross-scenario causal chain, in which a search-initiated interest in “Xiaomi” [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Our real dataset (OmniBehavior) interests evolve smoothly, while synthetic dataset (LoCoMo) interests show rigid, task-driven spikes. To investigate whether synthetic user simulation data can accurately reproduce the complex dynamics of real user interest evolution, we conduct a comparative analysis between authentic and synthetic user trajectories. Specifically, we compare OmniBehavior with LoCoMo [28… view at source ↗

**Figure 7.** Figure 7: Effect of Context Window Size. To evaluate LLMs’ capability to model long interaction histories, we test representative opensource and closed-source LLMs across varying context window sizes from 16K to 128K tokens. The experiments are conducted on a specific user subset (N = 66) with interaction histories exceeding 128K tokens to ensure sufficient data for scaling analysis. As shown in [PITH_FULL_IMAGE:… view at source ↗

**Figure 8.** Figure 8: Effect of Memory Management. To further examine whether commonly used context management mechanisms can alleviate the above limitations, we compare the performance of two representative memory management approaches based on Qwen3-235B. Both approaches operate on the full user interaction history. For summarization-based method, we periodically summarize the history whenever it reaches a 4k-token buffer. … view at source ↗

**Figure 9.** Figure 9: Comparison of positive interaction rates between real users and LLM-based simulators across scenarios. LLM-generated behaviors show substantially higher positive rates, revealing a systematic hyper-activity bias. We first compare real and simulated user behaviors at the distribution level by measuring the positive prediction rate, defined as the proportion of positive outcomes among all interactions. We … view at source ↗

**Figure 10.** Figure 10: Sentiment distribution of real users and [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Language style comparison between real users and LLM-simulated users. LLMgenerated utterances exhibit higher levels of politeness markers, hedging, and face-saving strategies, indicating a systematic tendency towards overly polite and non-confrontational language. Language Style. Complementary analysis of language style further reveals systematic differences in how users express dissatisfaction. Follow… view at source ↗

**Figure 12.** Figure 12: Comparison of Intra-user and Interuser behavioral distances for Human and LLMsimulated users. Real users exhibit significantly larger inter-user variation than intra-user variation, whereas LLM-generated users show heavily overlapping distributions, indicating a pronounced tendency toward persona homogenization. The above findings raise the question of whether LLM-based simulators preserve personalize… view at source ↗

**Figure 13.** Figure 13: Log-scaled distribution of user action sequence lengths, spanning hundreds to over 100k, requiring models to handle ultra-long contexts. To provide a clearer picture of OmniBehavior’s composition, we report additional statistics on user behavior sequences and population attributes [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Demographic and behavioral distributions of users in the benchmark. The charts show [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: The OmniBehavior Benchmark Scope. We construct a unified simulation environment [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

**Figure 16.** Figure 16: Vocabulary comparison of real users (left) and LLM-based simulators (right). The LLM [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

**Figure 17.** Figure 17: Intra-user vs. Inter-user behavioral distance distributions for all evaluated models. [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt for binary value prediction in the main experiment. [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt for extracting and understanding key information from live streaming cover [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗

**Figure 20.** Figure 20: Prompt for continuous value prediction in the main experiment. [PITH_FULL_IMAGE:figures/full_fig_p021_20.png] view at source ↗

**Figure 21.** Figure 21: Prompt for sentiment classification in Utopian Tendency experiment. [PITH_FULL_IMAGE:figures/full_fig_p021_21.png] view at source ↗

**Figure 22.** Figure 22: Prompt for text value prediction in the main experiment. [PITH_FULL_IMAGE:figures/full_fig_p022_22.png] view at source ↗

**Figure 23.** Figure 23: Prompt for identifying an item’s interest categories and keywords. [PITH_FULL_IMAGE:figures/full_fig_p022_23.png] view at source ↗

**Figure 24.** Figure 24: Prompt for raw data cleaning. You are a user behavior analysis expert focused on causal inference. Your task is to analyze a user’s historical interaction sequence, identify the key causal events that lead to a target behavior, and explain their roles. Input data includes the user history sequence and the target behavior. Task description: Review the entire history and identify key events that meaningfull… view at source ↗

**Figure 25.** Figure 25: Prompt for causal chain identification. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_25.png] view at source ↗

**Figure 26.** Figure 26: Prompt for language style comparison. E Case Study To facilitate an intuitive understanding of our evaluation pipeline, we provide qualitative examples across multiple representative application scenarios. Specifically, we select four real-world settings: behavior prediction in live-streaming ( [PITH_FULL_IMAGE:figures/full_fig_p024_26.png] view at source ↗

read the original abstract

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces OmniBehavior, a benchmark built from real long-horizon cross-scenario traces, and reports that LLMs converge on hyper-active, homogenized, and overly positive personas.

read the letter

The paper's core move is to replace narrow or synthetic tests with OmniBehavior, a collection of real user traces that span multiple scenarios over extended periods. It shows that isolated-scenario datasets miss the causal links across contexts and that current LLMs still plateau even when given longer histories. The direct comparison to held-out traces then surfaces three recurring gaps: models produce too many actions, flatten distinct individuals into similar profiles, and tilt toward positive outcomes. Those patterns are documented with concrete examples from the data, which is more useful than vague claims about realism. The work earns credit for releasing a unified real-world testbed and for isolating the loss of long-tail behaviors rather than stopping at overall accuracy scores. The soft spots sit in the measurement layer. The abstract and stress-test note both flag that the definitions of hyper-activity, homogenization, and Utopian bias depend on how actions are labeled, normalized, and aggregated. If trace collection over-samples active users or if positivity is scored against a global mean without per-person baselines, part of the reported gap could trace back to the benchmark construction. Full methods would need to show that the metrics hold under reasonable re-labeling or per-scenario controls. This paper is aimed at groups building user simulators or studying agent fidelity. Researchers who need a concrete testbed for long-horizon behavior will get immediate value from the data and the reported failure modes. It is coherent enough on its own terms to deserve referee time, even if the bias claims require tighter validation. I would send it for peer review and ask specifically for the annotation protocol and robustness checks on the three bias metrics.

Referee Report

2 major / 2 minor

Summary. The paper introduces OmniBehavior, the first benchmark for LLM-based user simulation constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral traces. It argues that prior benchmarks suffer from tunnel vision due to isolated scenarios, provides empirical evidence that LLMs struggle to simulate authentic complex behaviors (with performance plateauing despite larger context windows), and identifies a structural bias wherein LLMs converge toward a 'positive average person' via hyper-activity, persona homogenization, and Utopian bias, resulting in loss of individual differences and long-tail behaviors.

Significance. If the central claims hold after addressing data and metric robustness, the work would be significant for establishing a more realistic evaluation framework for human behavior simulation and for documenting concrete limitations in current LLMs' ability to capture behavioral heterogeneity. This could usefully direct research toward better fidelity in agent and user modeling. The contribution is tempered by the absence of detailed validation for the real traces and bias metrics, which directly affects how much weight the structural-bias conclusion can carry.

major comments (2)

[§3] §3 (Benchmark Construction): The manuscript provides no details on data collection protocols, sample sizes, participant selection criteria, annotation procedures, or statistical methods used to build the real-world traces. This is load-bearing for the central claim, as the comparison of LLM outputs to 'authentic' behaviors and the identification of structural biases presuppose that the collected traces are an unbiased, representative sample of heterogeneous long-horizon decision-making.
[§5] §5 (Bias Analysis): The quantitative definitions and measurement procedures for hyper-activity, persona homogenization, and Utopian bias are not formalized (no equations or explicit aggregation rules are given). The reported convergence to a 'positive average person' appears to use global means without per-person baselines or robustness checks against alternative labeling/aggregation choices; this leaves open the possibility that the observed gaps are benchmark-construction artifacts rather than intrinsic LLM properties, directly undermining the 'fundamental structural bias' conclusion.

minor comments (2)

[Abstract] The abstract states that 'performance plateauing' occurs as context windows expand but does not report the specific context lengths tested, the exact performance metrics (e.g., action prediction accuracy, sequence similarity), or the statistical significance of the plateau.
[Figures/Tables] Figure and table captions could more explicitly link visual results to the three named biases (hyper-activity, homogenization, Utopian bias) to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments identify areas where additional clarity and formalization will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns about data provenance and metric definitions.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The manuscript provides no details on data collection protocols, sample sizes, participant selection criteria, annotation procedures, or statistical methods used to build the real-world traces. This is load-bearing for the central claim, as the comparison of LLM outputs to 'authentic' behaviors and the identification of structural biases presuppose that the collected traces are an unbiased, representative sample of heterogeneous long-horizon decision-making.

Authors: We agree that Section 3 would benefit from expanded documentation of the data pipeline. In the revised manuscript we will add a dedicated subsection titled 'Trace Collection and Validation' that specifies: (i) participant recruitment channels and inclusion/exclusion criteria, (ii) exact sample sizes (number of users, total trace hours, and cross-scenario coverage), (iii) logging protocols and consent procedures, (iv) any post-collection annotation or scenario labeling steps, and (v) statistical checks performed to assess representativeness and heterogeneity. These additions will allow readers to evaluate the degree to which the traces support the authenticity claims. revision: yes
Referee: [§5] §5 (Bias Analysis): The quantitative definitions and measurement procedures for hyper-activity, persona homogenization, and Utopian bias are not formalized (no equations or explicit aggregation rules are given). The reported convergence to a 'positive average person' appears to use global means without per-person baselines or robustness checks against alternative labeling/aggregation choices; this leaves open the possibility that the observed gaps are benchmark-construction artifacts rather than intrinsic LLM properties, directly undermining the 'fundamental structural bias' conclusion.

Authors: We accept that the current presentation of the bias metrics lacks sufficient formalization. The revision will introduce explicit equations and aggregation rules for each bias: hyper-activity will be defined as the per-user deviation in action rate relative to the corresponding real trace; persona homogenization will be quantified via the reduction in behavioral embedding variance across simulated versus real individuals; and Utopian bias will be measured by a positivity score derived from outcome sentiment. In addition, we will report per-person baseline comparisons (rather than solely global means) and include robustness analyses that vary labeling granularity and aggregation functions (mean vs. median, different embedding models). These changes will directly test whether the observed convergence is robust or an artifact of the chosen metrics. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical comparison to independent real-world traces

full rationale

The paper constructs OmniBehavior from real-world data and evaluates LLMs via direct comparison of simulated outputs against held-out authentic behavioral traces. The reported structural biases (hyper-activity, persona homogenization, Utopian bias) are presented as outcomes of this external comparison rather than quantities derived by fitting parameters, self-defining metrics, or reducing via self-citation chains within the study. No equations, ansatzes, or uniqueness theorems are invoked that collapse the central claims back to the paper's own inputs by construction. The derivation chain remains self-contained against the benchmark data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Claims depend on the representativeness of real-world traces and validity of bias metrics, which are introduced without independent verification in the abstract.

axioms (1)

domain assumption Real-world behavioral traces can be assembled into a unified benchmark that faithfully captures long-horizon cross-scenario heterogeneity without significant selection or annotation artifacts
Invoked when constructing OmniBehavior from real data and when using it as ground truth for bias detection.

pith-pipeline@v0.9.0 · 5555 in / 1219 out tokens · 82018 ms · 2026-05-10T18:16:31.449091+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

current LLMs exhibit a substantial capability gap in modeling real-world user behaviors, regardless of context length

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 21 canonical work pages · 11 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Introducing claude 4

Anthropic. Introducing claude 4. https://www.anthropic.com/news/claude-4, May 2025

2025
[3]

Introducing claude haiku 4.5

Anthropic. Introducing claude haiku 4.5. https://www.anthropic.com/news/claude-haiku-4-5, October 2025

2025
[4]

Introducing claude opus 4.5

Anthropic. Introducing claude opus 4.5. https://www.anthropic.com/news/claude-opus-4-5, November 2025

2025
[5]

Introducing claude sonnet 4.5

Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/claude-sonnet-4-5, September 2025

2025
[6]

Designing economic agents that act like human agents: A behavioral approach to bounded rationality.The American economic review, 81(2):353–359, 1991

W Brian Arthur. Designing economic agents that act like human agents: A behavioral approach to bounded rationality.The American economic review, 81(2):353–359, 1991

1991
[7]

The netflix prize

James Bennett and Stan Lanning. The netflix prize. 2007

2007
[8]

Simulations in recommender systems: An industry perspective.arXiv preprint arXiv:2109.06723, 2021

Lucas Bernardi, Sakshi Batra, and Cintia Alicia Bruscantini. Simulations in recommender systems: An industry perspective.arXiv preprint arXiv:2109.06723, 2021

work page arXiv 2021
[9]

Simuser: Simulating user behavior with large language models for recommender system evaluation

Nicolas Bougie and Narimawa Watanabe. Simuser: Simulating user behavior with large language models for recommender system evaluation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 43–60, 2025

2025
[10]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023

work page internal anchor Pith review arXiv 2023
[11]

Consistentchat: Building skeleton-guided consistent multi-turn dialogues for large language models from scratch

Jiawei Chen, Xinyan Guan, Qianhao Yuan, Guozhao Mo, Weixiang Zhou, Yaojie Lu, Hongyu Lin, Ben He, Le Sun, and Xianpei Han. Consistentchat: Building skeleton-guided consistent multi-turn dialogues for large language models from scratch. InThe 2025 Conference on Empirical Methods in Natural Language Processing, 2025. 12

2025
[12]

Simulation of individual and group behavior.The American Economic Review, pages 920–932, 1960

Geoffrey PE Clarkson and Herbert A Simon. Simulation of individual and group behavior.The American Economic Review, pages 920–932, 1960

1960
[13]

A computational approach to politeness with application to social factors

Cristian Danescu, Niculescu Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christo- pher Potts. A computational approach to politeness with application to social factors. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 250–259, 2013

2013
[14]

Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024

Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024

2024
[15]

Kuairec: A fully-observed dataset and insights for evaluating recommender systems

Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. Kuairec: A fully-observed dataset and insights for evaluating recommender systems. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 540–550, 2022

2022
[16]

Gemini 3 flash: frontier intelligence built for speed

Google. Gemini 3 flash: frontier intelligence built for speed. https://blog.google/products-and- platforms/products/gemini/gemini-3-flash/, December 2025

2025
[17]

The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015

F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015

2015
[18]

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, and Paul Röttger. Simbench: Benchmarking the ability of large language models to simulate human behaviors. arXiv preprint arXiv:2510.17516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Recsim: A conﬁgurable simulation platform for recommender systems

Eugene Ie, Chih-wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. Recsim: A configurable simulation platform for recommender systems. arXiv preprint arXiv:1909.04847, 2019

work page arXiv 1909
[21]

Surrealdriver: Designing generative driver agent simulation framework in urban contexts based on large language model.arXiv preprint arXiv:2309.13193, 5(7):8, 2023

Ye Jin, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue Zhou, and Jiangtao Gong. Surrealdriver: Designing generative driver agent simulation framework in urban contexts based on large language model.arXiv preprint arXiv:2309.13193, 5(7):8, 2023

work page arXiv 2023
[22]

Sim4ia-bench: A user simulation benchmark suite for next query and utterance prediction.arXiv preprint arXiv:2511.09329, 2025

Andreas Konstantin Kruff, Christin Katharina Kreutz, Timo Breuer, Philipp Schaer, and Krisz- tian Balog. Sim4ia-bench: A user simulation benchmark suite for next query and utterance prediction.arXiv preprint arXiv:2511.09329, 2025

work page arXiv 2025
[23]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

2023
[24]

Artificial intelligence and simulation: An introduction

R Greer Lavery. Artificial intelligence and simulation: An introduction. InProceedings of the 18th conference on Winter simulation, pages 448–452, 1986

1986
[25]

Field theory in social science: selected theoretical papers (edited by dorwin cartwright.)

Kurt Lewin. Field theory in social science: selected theoretical papers (edited by dorwin cartwright.). 1951

1951
[26]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024. 13

2024
[28]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024

work page internal anchor Pith review arXiv 2024
[29]

Langchain

Vasilios Mavroudis. Langchain. 2024

2024
[30]

The place of modeling in cognitive science.Topics in Cognitive Science, 1(1):11–38, 2009

James L McClelland. The place of modeling in cognitive science.Topics in Cognitive Science, 1(1):11–38, 2009

2009
[31]

Some methods of classification and analysis of multivariate observations

James B McQueen. Some methods of classification and analysis of multivariate observations. InProc. of 5th Berkeley Symposium on Math. Stat. and Prob., pages 281–297, 1967

1967
[32]

arXiv preprint arXiv:2402.16333 , year=

X Mou, Z Wei, and X Huang. Unveiling the truth and facilitating change: Towards agent-based large-scale social movement simulation. arxiv 2024.arXiv preprint arXiv:2402.16333

work page arXiv 2024
[33]

Cognitive science and artificial intelligence: simulating the human mind and its complexity.Cognitive Computation and Systems, 1(4):113–116, 2019

Mohd Naveed Uddin. Cognitive science and artificial intelligence: simulating the human mind and its complexity.Cognitive Computation and Systems, 1(4):113–116, 2019

2019
[34]

OpenAI. Gpt-5.2. https://openai.com/zh-Hans-CN/index/introducing-gpt-5-2/, 2025

2025
[35]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

2023
[36]

Artificial intelligence and virtual worlds–toward human-level ai agents

Vladimir M Petrovi´c. Artificial intelligence and virtual worlds–toward human-level ai agents. IEEE Access, 6:39976–39988, 2018

2018
[37]

Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction

Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. InProceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 2685–2692, 2020

2020
[38]

Computational politeness in natural language processing: A survey.ACM Computing Surveys, 56(9):1–42, 2024

Priyanshu Priya, Mauajama Firdaus, and Asif Ekbal. Computational politeness in natural language processing: A survey.ACM Computing Surveys, 56(9):1–42, 2024

2024
[39]

KuaiLive: A Real-time Interactive Dataset for Live Streaming Recommendation

Changle Qu, Sunhao Dai, Ke Guo, Liqin Zhao, Yanan Niu, Xiao Zhang, and Jun Xu. Kuailive: A real-time interactive dataset for live streaming recommendation.arXiv preprint arXiv:2508.05633, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

A plea for (good) simulations: nudging economics toward an experimental science

Julian Reiss. A plea for (good) simulations: nudging economics toward an experimental science. Simulation & gaming, 42(2):243–264, 2011

2011
[41]

Bases: Large-scale web search user simulation with large language model based agents.arXiv preprint arXiv:2402.17505, 2024

Ruiyang Ren, Peng Qiu, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Hua Wu, Ji-Rong Wen, and Haifeng Wang. Bases: Large-scale web search user simulation with large language model based agents.arXiv preprint arXiv:2402.17505, 2024

work page arXiv 2024
[42]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023

work page internal anchor Pith review arXiv 2023
[43]

Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning

Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and An-Xiang Zeng. Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4902–4909, 2019

2019
[44]

Integrating llm in agent-based social simulation: Opportunities and challenges.arXiv preprint arXiv:2507.19364, 2025

Patrick Taillandier, Jean Daniel Zucker, Arnaud Grignard, Benoit Gaudou, Nghi Quang Huynh, and Alexis Drogoul. Integrating llm in agent-based social simulation: Opportunities and challenges.arXiv preprint arXiv:2507.19364, 2025

work page arXiv 2025
[45]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Recagent: A novel simulation paradigm for recommender systems

Lei Wang, Jingsen Zhang, Hao Yang, Zhiyuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Ruihua Song, Wayne Xin Zhao, et al. When large language model based agent meets user behavior analysis: A novel user simulation paradigm.arXiv preprint arXiv:2306.02552, 2023

work page arXiv 2023
[47]

Characterbox: Evaluating the role-playing capabilities of llms in text-based virtual worlds

Lei Wang, Jianxun Lian, Yi Huang, Yanqi Dai, Haoxuan Li, Xu Chen, Xing Xie, and Ji-Rong Wen. Characterbox: Evaluating the role-playing capabilities of llms in text-based virtual worlds. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long P...

2025
[48]

Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models

Noah Wang, Zy Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, et al. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 14743–14777, 2024

2024
[49]

A survey on llm-based agents for social simulation: Taxonomy, evaluation and applications

Zixu Wang, Bin Xie, Bingbing Xu, Shengmao Zhu, Yige Yuan, Liang Pang, Long Yang Du Su, Zixuan Li, Huawei Shen, and Xueqi Cheng. A survey on llm-based agents for social simulation: Taxonomy, evaluation and applications
[50]

Human simulacra: Benchmarking the personification of large language models.arXiv preprint arXiv:2402.18180, 2024

Qiuejie Xie, Qiming Feng, Tianqi Zhang, Qingqiu Li, Linyi Yang, Yuejie Zhang, Rui Feng, Liang He, Shang Gao, and Yue Zhang. Human simulacra: Benchmarking the personification of large language models.arXiv preprint arXiv:2402.18180, 2024

work page arXiv 2024
[51]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383, 2025

work page internal anchor Pith review arXiv 2025
[53]

Evaluating large language models as generative user simulators for conversational recommendation

Se-eun Yoon, Zhankui He, Jessica Echterhoff, and Julian McAuley. Evaluating large language models as generative user simulators for conversational recommendation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1490–1504, 2024

2024
[54]

Tenrec: A large-scale multipurpose benchmark dataset for rec- ommender systems.Advances in Neural Information Processing Systems, 35:11480–11493, 2022

Guanghu Yuan, Fajie Yuan, Yudong Li, Beibei Kong, Shujie Li, Lei Chen, Min Yang, Chenyun Yu, Bo Hu, Zang Li, et al. Tenrec: A large-scale multipurpose benchmark dataset for rec- ommender systems.Advances in Neural Information Processing Systems, 35:11480–11493, 2022

2022
[55]

Glm-4.7: Advancing the coding capability

Z.ai. Glm-4.7: Advancing the coding capability. https://z.ai/blog/glm-4.7, December 2025

2025
[56]

Agentcf: Collaborative learning with autonomous language agents for recommender systems

Junjie Zhang, Yupeng Hou, Ruobing Xie, Wenqi Sun, Julian McAuley, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. Agentcf: Collaborative learning with autonomous language agents for recommender systems. InProceedings of the ACM Web Conference 2024, pages 3679–3689, 2024

2024
[57]

Ai-salesman: Towards reliable large language model driven telemarketing

Qingyu Zhang, Chunlei Xin, Xuanang Chen, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Qing Ye, Qianlong Xie, and Xingxing Wang. Ai-salesman: Towards reliable large language model driven telemarketing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34790–34798, 2026

2026
[58]

Evaluating conversational recommender systems via user simulation

Shuo Zhang and Krisztian Balog. Evaluating conversational recommender systems via user simulation. InProceedings of the 26th acm sigkdd international conference on knowledge discovery & data mining, pages 1512–1520, 2020

2020
[59]

The jddc 2.0 corpus: A large-scale multimodal multi-turn chinese dialogue dataset for e-commerce customer service

Nan Zhao, Haoran Li, Youzheng Wu, Xiaodong He, and Bowen Zhou. The jddc 2.0 corpus: A large-scale multimodal multi-turn chinese dialogue dataset for e-commerce customer service. arXiv preprint arXiv:2109.12913, 2021

work page arXiv 2021
[60]

Could you

Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1059–1068, 2018. 15 A Data Statistics A.1 Action Sequence Length Distribution F...

2018
[61]

Live streaming Type

Live streaming type: What type of live streaming is this (E-commerce / gaming / chatting / talent performance, etc.) 2. Host characteristics: The host’s basic appearance features 3. Image text: Extract key text from the cover (**Note: Only extract core text such as live streaming title, product names, prices, promotional information, etc. Do NOT extract b...
[62]

Beauty,” “Games,

One concise category (Category), such as “Beauty,” “Games,” “News,” etc
[63]

text" Output Format: Return only a JSON object containing two fields:

Three specific keywords (Keywords). Ignore the interactive form of the text. Even if it is casual chat between friends, look beyond the social surface and identify the underlying topic being discussed. Content: "text" Output Format: Return only a JSON object containing two fields: "category" and "keywords" (a list of strings). Example: "category": "Techno...
[64]

uh,” “ah,

Noise Removal: * Remove meaningless garbled characters (e.g., AC:BU526, IC·BQ528, within 50 meters, and other interfering information). * Filter excessively redundant filler words, such as repeated occurrences of “uh,” “ah,” “that is to say,” retaining only those necessary for context
[65]

cumin cowhide

Semantic Correction: * Correct obvious recognition errors (e.g., change “cumin cowhide” to “naturally revealed,” or infer based on context; if the correct meaning cannot be determined, keep the original). * Complete broken sentences and add commas, periods, or question marks appropriately based on tone and emphasis. 4. Formatting Standards: * Unify full-w...

2025