Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Boxi Cao; Hongyu Lin; Jiawei Chen; Le Sun; Ruotong Pan; Ruoxi Xu; Tingting Gao; Xiangyu Wu; Xianpei Han; Yaojie Lu

arxiv: 2604.08362 · v2 · pith:ZSEJRDVMnew · submitted 2026-04-09 · 💻 cs.CL · cs.AI· cs.LG

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Jiawei Chen , Ruoxi Xu , Boxi Cao , Ruotong Pan , Yunfei Zhang , Yifei Hu , Yong Du , Tingting Gao

show 6 more authors

Yaojie Lu Yingfei Sun Xianpei Han Le Sun Xiangyu Wu Hongyu Lin

This is my paper

Pith reviewed 2026-05-22 10:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords LLM user simulationbehavior benchmarkstructural biaspersona homogenizationlong-horizon tracesreal-world datautopian bias

0 comments

The pith

LLMs simulating real human behavior converge toward a positive average person and erase individual differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds OmniBehavior, a benchmark drawn entirely from real-world traces, to test how well large language models can act as user simulators across long sequences that cross multiple life scenarios. It first shows that prior benchmarks using isolated or synthetic settings miss the causal chains that link decisions over time in actual human lives. When state-of-the-art models are evaluated on the new benchmark, they produce behaviors that are more active, more uniform across people, and more optimistic than the source data. The resulting structural bias removes the variability and infrequent patterns that define real individuals, limiting how faithfully any downstream application can replay or predict human actions.

Core claim

A systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a utopian bias. This results in the loss of individual differences and long-tail behaviors.

What carries the argument

The OmniBehavior benchmark, which assembles long-horizon, cross-scenario, and heterogeneous behavioral patterns directly from real-world data to serve as ground truth for simulation fidelity.

If this is right

Isolated-scenario datasets create tunnel vision that hides the cross-scenario causal chains present in real decision-making.
LLM simulation performance plateaus even when context windows are enlarged.
The structural bias produces outputs that systematically omit the low-frequency behaviors observed in authentic traces.
High-fidelity simulation will require explicit mechanisms to preserve individual differences rather than defaulting to an averaged persona.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same convergence may appear in other generative tasks that rely on modeling user preferences or sequences, such as personalized recommendation or dialogue systems.
A practical extension would be to add explicit regularization or retrieval steps that force models to reproduce measured frequencies of rare actions from the source traces.
Testing whether the bias persists when models are given explicit negative or low-activity examples from the same data would clarify whether the issue is data scarcity or architectural.
If the homogenization is confirmed across multiple languages or cultures, it would indicate a training-data skew rather than a language-specific artifact.

Load-bearing premise

The real-world behavioral traces collected for OmniBehavior accurately and representatively capture authentic long-horizon, cross-scenario human decision-making without significant selection or measurement biases.

What would settle it

Collect a fresh set of long-horizon traces from a demographically different population that explicitly includes documented long-tail decisions, then measure whether LLM outputs still flatten those decisions into positive averages.

Figures

Figures reproduced from arXiv: 2604.08362 by Boxi Cao, Hongyu Lin, Jiawei Chen, Le Sun, Ruotong Pan, Ruoxi Xu, Tingting Gao, Xiangyu Wu, Xianpei Han, Yaojie Lu, Yifei Hu, Yingfei Sun, Yong Du, Yunfei Zhang.

**Figure 1.** Figure 1: Overview of OmniBehavior, a real-world comprehensive benchmark for evaluating LLM-based user simulators. The benchmark is constructed in three stages: (1) Data Collection: aggregation of real-world logs from the Kuaishou platform across several major scenarios, with Customer Service treated as part of the E-commerce scenario, yielding five scenarios in total after aggregation. (2) Data Processing: multi-mo… view at source ↗

**Figure 2.** Figure 2: User profile reconstruction based on Single-scenario vs. Multi-scenario data. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Cumulative interest coverage with increasing scenarios. Determining how to construct a dataset that fully captures the decision-making process requires a deep understanding of user causal chains. To this end, we sample 180 high-value conversion events (e.g., “Purchase”) and trace the full historical interaction paths leading to final outcome using Claude Sonnet-4.5 [5]. All traced paths are further manua… view at source ↗

**Figure 4.** Figure 4: Distributions of causal chain spans. (a) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Case study of a cross-scenario causal chain, in which a search-initiated interest in “Xiaomi” [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Our real dataset (OmniBehavior) interests evolve smoothly, while synthetic dataset (LoCoMo) interests show rigid, task-driven spikes. To investigate whether synthetic user simulation data can accurately reproduce the complex dynamics of real user interest evolution, we conduct a comparative analysis between authentic and synthetic user trajectories. Specifically, we compare OmniBehavior with LoCoMo [28… view at source ↗

**Figure 7.** Figure 7: Effect of Context Window Size. To evaluate LLMs’ capability to model long interaction histories, we test representative opensource and closed-source LLMs across varying context window sizes from 16K to 128K tokens. The experiments are conducted on a specific user subset (N = 66) with interaction histories exceeding 128K tokens to ensure sufficient data for scaling analysis. As shown in [PITH_FULL_IMAGE:… view at source ↗

**Figure 8.** Figure 8: Effect of Memory Management. To further examine whether commonly used context management mechanisms can alleviate the above limitations, we compare the performance of two representative memory management approaches based on Qwen3-235B. Both approaches operate on the full user interaction history. For summarization-based method, we periodically summarize the history whenever it reaches a 4k-token buffer. … view at source ↗

**Figure 9.** Figure 9: Comparison of positive interaction rates between real users and LLM-based simulators across scenarios. LLM-generated behaviors show substantially higher positive rates, revealing a systematic hyper-activity bias. We first compare real and simulated user behaviors at the distribution level by measuring the positive prediction rate, defined as the proportion of positive outcomes among all interactions. We … view at source ↗

**Figure 10.** Figure 10: Sentiment distribution of real users and [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Language style comparison between real users and LLM-simulated users. LLMgenerated utterances exhibit higher levels of politeness markers, hedging, and face-saving strategies, indicating a systematic tendency towards overly polite and non-confrontational language. Language Style. Complementary analysis of language style further reveals systematic differences in how users express dissatisfaction. Follow… view at source ↗

**Figure 12.** Figure 12: Comparison of Intra-user and Interuser behavioral distances for Human and LLMsimulated users. Real users exhibit significantly larger inter-user variation than intra-user variation, whereas LLM-generated users show heavily overlapping distributions, indicating a pronounced tendency toward persona homogenization. The above findings raise the question of whether LLM-based simulators preserve personalize… view at source ↗

**Figure 13.** Figure 13: Log-scaled distribution of user action sequence lengths, spanning hundreds to over 100k, requiring models to handle ultra-long contexts. To provide a clearer picture of OmniBehavior’s composition, we report additional statistics on user behavior sequences and population attributes [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Demographic and behavioral distributions of users in the benchmark. The charts show [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: The OmniBehavior Benchmark Scope. We construct a unified simulation environment [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

**Figure 16.** Figure 16: Vocabulary comparison of real users (left) and LLM-based simulators (right). The LLM [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

**Figure 17.** Figure 17: Intra-user vs. Inter-user behavioral distance distributions for all evaluated models. [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt for binary value prediction in the main experiment. [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt for extracting and understanding key information from live streaming cover [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗

**Figure 20.** Figure 20: Prompt for continuous value prediction in the main experiment. [PITH_FULL_IMAGE:figures/full_fig_p021_20.png] view at source ↗

**Figure 21.** Figure 21: Prompt for sentiment classification in Utopian Tendency experiment. [PITH_FULL_IMAGE:figures/full_fig_p021_21.png] view at source ↗

**Figure 22.** Figure 22: Prompt for text value prediction in the main experiment. [PITH_FULL_IMAGE:figures/full_fig_p022_22.png] view at source ↗

**Figure 23.** Figure 23: Prompt for identifying an item’s interest categories and keywords. [PITH_FULL_IMAGE:figures/full_fig_p022_23.png] view at source ↗

**Figure 24.** Figure 24: Prompt for raw data cleaning. You are a user behavior analysis expert focused on causal inference. Your task is to analyze a user’s historical interaction sequence, identify the key causal events that lead to a target behavior, and explain their roles. Input data includes the user history sequence and the target behavior. Task description: Review the entire history and identify key events that meaningfull… view at source ↗

**Figure 25.** Figure 25: Prompt for causal chain identification. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_25.png] view at source ↗

**Figure 26.** Figure 26: Prompt for language style comparison. E Case Study To facilitate an intuitive understanding of our evaluation pipeline, we provide qualitative examples across multiple representative application scenarios. Specifically, we select four real-world settings: behavior prediction in live-streaming ( [PITH_FULL_IMAGE:figures/full_fig_p024_26.png] view at source ↗

read the original abstract

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OmniBehavior gives a real-data benchmark for long-horizon LLM simulation and flags a convergence bias, but the data collection details will decide if the bias claim holds.

read the letter

The paper's core contribution is a benchmark built from actual user traces that span multiple scenarios and long time horizons, then used to test how well current LLMs can replay those traces. It reports that the models settle on an overly positive, high-activity average persona and lose the individual variation and rare behaviors present in the logs. That pattern is presented as a structural limitation rather than a fixable prompt issue. The shift away from synthetic or single-scenario data is the clearest advance; it lets them demonstrate that isolated benchmarks miss the causal chains across contexts that real decisions involve. The plateau in performance with larger context windows is also shown directly against the real traces. Those comparisons are the parts that could matter for people trying to build agent simulators or user models. The main uncertainty sits with the ground-truth data. The abstract claims the traces are entirely real-world, but without specifics on how participants were recruited, how complete the logging was across quiet periods, or how demographic coverage was checked, it is hard to separate LLM shortcomings from possible skew in the reference set. If the logs over-sample active or digitally visible users, the reported homogenization and utopian tilt could partly reflect that mismatch instead of an intrinsic model property. The paper would benefit from explicit controls or sensitivity checks on the trace collection. This work is aimed at researchers building or evaluating behavior simulators for HCI, virtual environments, or personalized systems. Anyone already running LLM-based agents will find the empirical gaps useful to see. It is coherent enough on its own terms to warrant referee time, even though the data-quality questions will likely require revision. I would send it out for review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper introduces OmniBehavior, a benchmark for LLM-based user simulation constructed entirely from real-world long-horizon, cross-scenario, and heterogeneous behavioral traces. It argues that existing isolated-scenario datasets suffer from tunnel vision compared to real-world causal chains, shows that state-of-the-art LLMs struggle to simulate these behaviors with performance plateauing despite larger context windows, and identifies a structural bias in LLMs toward a positive average person, manifested as hyper-activity, persona homogenization, and utopian bias that erases individual differences and long-tail behaviors.

Significance. If the empirical comparisons and bias findings hold after rigorous validation of the ground-truth data, this work would be significant for advancing user simulation research in NLP and HCI. It provides the first unified real-world benchmark beyond synthetic or narrow scenarios and surfaces concrete failure modes (homogenization, loss of long-tail events) that could guide mitigation strategies in generative behavior modeling.

major comments (2)

[Dataset construction / OmniBehavior description] The manuscript states that OmniBehavior is 'constructed entirely from real-world data' and that systematic differences reveal LLM structural bias, but supplies no details on recruitment, logging completeness, demographic coverage, or handling of missing long-tail events. This is load-bearing for the central claim because the reported convergence to a positive average person and loss of individual differences could arise from selection or measurement biases in the reference traces rather than an intrinsic LLM property.
[Evaluation and results] The abstract reports 'extensive evaluations' of LLMs, performance plateauing with context expansion, and a 'fundamental structural bias,' yet provides no metrics (e.g., behavioral divergence, accuracy on action sequences), statistical tests, data scale (number of users/traces), or controls. Without these, the evidence for both the simulation failures and the specific bias patterns (hyper-activity, homogenization, utopian bias) cannot be assessed for robustness.

minor comments (1)

[Abstract / bias analysis] Clarify the precise definition of 'utopian bias' and 'positive average person' with concrete examples from the traces to avoid ambiguity in interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments in detail below and have prepared revisions to improve the clarity and completeness of the paper.

read point-by-point responses

Referee: [Dataset construction / OmniBehavior description] The manuscript states that OmniBehavior is 'constructed entirely from real-world data' and that systematic differences reveal LLM structural bias, but supplies no details on recruitment, logging completeness, demographic coverage, or handling of missing long-tail events. This is load-bearing for the central claim because the reported convergence to a positive average person and loss of individual differences could arise from selection or measurement biases in the reference traces rather than an intrinsic LLM property.

Authors: We agree that providing more details on dataset construction is important for validating our claims. In the revised version of the manuscript, we will expand the relevant section to include information on recruitment procedures, logging completeness, demographic coverage summaries, and methods for handling missing long-tail events. This will help demonstrate that the observed biases are not artifacts of data collection biases. We note that ethical and privacy considerations limit the extent of detail we can provide on individual participants. revision: yes
Referee: [Evaluation and results] The abstract reports 'extensive evaluations' of LLMs, performance plateauing with context expansion, and a 'fundamental structural bias,' yet provides no metrics (e.g., behavioral divergence, accuracy on action sequences), statistical tests, data scale (number of users/traces), or controls. Without these, the evidence for both the simulation failures and the specific bias patterns (hyper-activity, homogenization, utopian bias) cannot be assessed for robustness.

Authors: The full paper contains these metrics and details in the experiments and results sections. To address the concern about accessibility, we will update the abstract to briefly mention key quantitative findings and include a summary of the evaluation metrics, statistical tests, data scale, and controls in the main text or a new table. This revision will make the evidence more readily assessable while preserving the paper's structure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark rests on external real-world traces

full rationale

The paper introduces OmniBehavior as a benchmark constructed from real-world data and evaluates LLMs via direct comparisons of simulated versus authentic long-horizon behaviors. No equations, derivations, fitted parameters, or self-citations appear in the provided text as load-bearing elements for the central claims. The reported structural biases (hyper-activity, homogenization, utopian bias) are presented as outcomes of systematic differences against the external reference traces rather than reducing to any input by construction. This is a standard empirical setup self-contained against external benchmarks, consistent with the default non-circular finding for such papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest primarily on the domain assumption that the assembled real-world traces form a faithful, unbiased representation of holistic human behavior; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Real-world data can be integrated into a unified framework capturing long-horizon, cross-scenario, and heterogeneous behavioral patterns without tunnel vision.
Invoked to justify why prior isolated-scenario datasets are insufficient and why the new benchmark reveals true LLM limitations.

pith-pipeline@v0.9.0 · 5785 in / 1357 out tokens · 66873 ms · 2026-05-22T10:26:50.614032+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel; Jcost_pos_of_ne_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a utopian bias. This results in the loss of individual differences and long-tail behaviors.
IndisputableMonolith/Cost.lean Jcost_unit0; Jcost_pos_of_ne_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Real human behavior is inherently sparse, with positive interaction rates remaining below 10%. By contrast, all evaluated LLM-based simulators exhibit a hyper-activity bias.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 11 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Introducing claude 4

Anthropic. Introducing claude 4. https://www.anthropic.com/news/claude-4, May 2025

work page 2025
[3]

Introducing claude haiku 4.5

Anthropic. Introducing claude haiku 4.5. https://www.anthropic.com/news/claude-haiku-4-5, October 2025

work page 2025
[4]

Introducing claude opus 4.5

Anthropic. Introducing claude opus 4.5. https://www.anthropic.com/news/claude-opus-4-5, November 2025

work page 2025
[5]

Introducing claude sonnet 4.5

Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/claude-sonnet-4-5, September 2025

work page 2025
[6]

Designing economic agents that act like human agents: A behavioral approach to bounded rationality.The American economic review, 81(2):353–359, 1991

W Brian Arthur. Designing economic agents that act like human agents: A behavioral approach to bounded rationality.The American economic review, 81(2):353–359, 1991

work page 1991
[7]

The netflix prize

James Bennett and Stan Lanning. The netflix prize. 2007

work page 2007
[8]

Simulations in recommender systems: An industry perspective.arXiv preprint arXiv:2109.06723, 2021

Lucas Bernardi, Sakshi Batra, and Cintia Alicia Bruscantini. Simulations in recommender systems: An industry perspective.arXiv preprint arXiv:2109.06723, 2021

work page arXiv 2021
[9]

Simuser: Simulating user behavior with large language models for recommender system evaluation

Nicolas Bougie and Narimawa Watanabe. Simuser: Simulating user behavior with large language models for recommender system evaluation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 43–60, 2025

work page 2025
[10]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Consistentchat: Building skeleton-guided consistent multi-turn dialogues for large language models from scratch

Jiawei Chen, Xinyan Guan, Qianhao Yuan, Guozhao Mo, Weixiang Zhou, Yaojie Lu, Hongyu Lin, Ben He, Le Sun, and Xianpei Han. Consistentchat: Building skeleton-guided consistent multi-turn dialogues for large language models from scratch. InThe 2025 Conference on Empirical Methods in Natural Language Processing, 2025. 12

work page 2025
[12]

Simulation of individual and group behavior.The American Economic Review, pages 920–932, 1960

Geoffrey PE Clarkson and Herbert A Simon. Simulation of individual and group behavior.The American Economic Review, pages 920–932, 1960

work page 1960
[13]

A computational approach to politeness with application to social factors

Cristian Danescu, Niculescu Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christo- pher Potts. A computational approach to politeness with application to social factors. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 250–259, 2013

work page 2013
[14]

Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024

Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024

work page 2024
[15]

Kuairec: A fully-observed dataset and insights for evaluating recommender systems

Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. Kuairec: A fully-observed dataset and insights for evaluating recommender systems. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 540–550, 2022

work page 2022
[16]

Gemini 3 flash: frontier intelligence built for speed

Google. Gemini 3 flash: frontier intelligence built for speed. https://blog.google/products-and- platforms/products/gemini/gemini-3-flash/, December 2025

work page 2025
[17]

The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015

F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015

work page 2015
[18]

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, and Paul Röttger. Simbench: Benchmarking the ability of large language models to simulate human behaviors. arXiv preprint arXiv:2510.17516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Recsim: A configurable simulation platform for recommender systems

Eugene Ie, Chih-wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. Recsim: A configurable simulation platform for recommender systems. arXiv preprint arXiv:1909.04847, 2019

work page arXiv 1909
[21]

Surrealdriver: Designing llm-powered generative driver agent framework based on human drivers’ driving-thinking data,

Ye Jin, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue Zhou, and Jiangtao Gong. Surrealdriver: Designing generative driver agent simulation framework in urban contexts based on large language model.arXiv preprint arXiv:2309.13193, 5(7):8, 2023

work page arXiv 2023
[22]

Sim4ia-bench: A user simulation benchmark suite for next query and utterance prediction.arXiv preprint arXiv:2511.09329, 2025

Andreas Konstantin Kruff, Christin Katharina Kreutz, Timo Breuer, Philipp Schaer, and Krisz- tian Balog. Sim4ia-bench: A user simulation benchmark suite for next query and utterance prediction.arXiv preprint arXiv:2511.09329, 2025

work page arXiv 2025
[23]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023
[24]

Artificial intelligence and simulation: An introduction

R Greer Lavery. Artificial intelligence and simulation: An introduction. InProceedings of the 18th conference on Winter simulation, pages 448–452, 1986

work page 1986
[25]

Field theory in social science: selected theoretical papers (edited by dorwin cartwright.)

Kurt Lewin. Field theory in social science: selected theoretical papers (edited by dorwin cartwright.). 1951

work page 1951
[26]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024. 13

work page 2024
[28]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Langchain

Vasilios Mavroudis. Langchain. 2024

work page 2024
[30]

The place of modeling in cognitive science.Topics in Cognitive Science, 1(1):11–38, 2009

James L McClelland. The place of modeling in cognitive science.Topics in Cognitive Science, 1(1):11–38, 2009

work page 2009
[31]

Some methods of classification and analysis of multivariate observations

James B McQueen. Some methods of classification and analysis of multivariate observations. InProc. of 5th Berkeley Symposium on Math. Stat. and Prob., pages 281–297, 1967

work page 1967
[32]

Unveiling the truth and facilitating change: To- wards agent-based large-scale social movement simulation

X Mou, Z Wei, and X Huang. Unveiling the truth and facilitating change: Towards agent-based large-scale social movement simulation. arxiv 2024.arXiv preprint arXiv:2402.16333

work page arXiv 2024
[33]

Cognitive science and artificial intelligence: simulating the human mind and its complexity.Cognitive Computation and Systems, 1(4):113–116, 2019

Mohd Naveed Uddin. Cognitive science and artificial intelligence: simulating the human mind and its complexity.Cognitive Computation and Systems, 1(4):113–116, 2019

work page 2019
[34]

OpenAI. Gpt-5.2. https://openai.com/zh-Hans-CN/index/introducing-gpt-5-2/, 2025

work page 2025
[35]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023
[36]

Artificial intelligence and virtual worlds–toward human-level ai agents

Vladimir M Petrovi´c. Artificial intelligence and virtual worlds–toward human-level ai agents. IEEE Access, 6:39976–39988, 2018

work page 2018
[37]

Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction

Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. InProceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 2685–2692, 2020

work page 2020
[38]

Computational politeness in natural language processing: A survey.ACM Computing Surveys, 56(9):1–42, 2024

Priyanshu Priya, Mauajama Firdaus, and Asif Ekbal. Computational politeness in natural language processing: A survey.ACM Computing Surveys, 56(9):1–42, 2024

work page 2024
[39]

KuaiLive: A Real-time Interactive Dataset for Live Streaming Recommendation

Changle Qu, Sunhao Dai, Ke Guo, Liqin Zhao, Yanan Niu, Xiao Zhang, and Jun Xu. Kuailive: A real-time interactive dataset for live streaming recommendation.arXiv preprint arXiv:2508.05633, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

A plea for (good) simulations: nudging economics toward an experimental science

Julian Reiss. A plea for (good) simulations: nudging economics toward an experimental science. Simulation & gaming, 42(2):243–264, 2011

work page 2011
[41]

Bases: Large-scale web search user simulation with large language model based agents.arXiv preprint arXiv:2402.17505, 2024

Ruiyang Ren, Peng Qiu, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Hua Wu, Ji-Rong Wen, and Haifeng Wang. Bases: Large-scale web search user simulation with large language model based agents.arXiv preprint arXiv:2402.17505, 2024

work page arXiv 2024
[42]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning

Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and An-Xiang Zeng. Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4902–4909, 2019

work page 2019
[44]

Integrating llm in agent-based social simulation: Opportunities and challenges.arXiv preprint arXiv:2507.19364, 2025

Patrick Taillandier, Jean Daniel Zucker, Arnaud Grignard, Benoit Gaudou, Nghi Quang Huynh, and Alexis Drogoul. Integrating llm in agent-based social simulation: Opportunities and challenges.arXiv preprint arXiv:2507.19364, 2025

work page arXiv 2025
[45]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

User behavior simulation with large language model based agents

Lei Wang, Jingsen Zhang, Hao Yang, Zhiyuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Ruihua Song, Wayne Xin Zhao, et al. When large language model based agent meets user behavior analysis: A novel user simulation paradigm.arXiv preprint arXiv:2306.02552, 2023

work page arXiv 2023
[47]

Characterbox: Evaluating the role-playing capabilities of llms in text-based virtual worlds

Lei Wang, Jianxun Lian, Yi Huang, Yanqi Dai, Haoxuan Li, Xu Chen, Xing Xie, and Ji-Rong Wen. Characterbox: Evaluating the role-playing capabilities of llms in text-based virtual worlds. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long P...

work page 2025
[48]

Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models

Noah Wang, Zy Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, et al. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 14743–14777, 2024

work page 2024
[49]

A survey on llm-based agents for social simulation: Taxonomy, evaluation and applications

Zixu Wang, Bin Xie, Bingbing Xu, Shengmao Zhu, Yige Yuan, Liang Pang, Long Yang Du Su, Zixuan Li, Huawei Shen, and Xueqi Cheng. A survey on llm-based agents for social simulation: Taxonomy, evaluation and applications

work page
[50]

Human simulacra: Benchmarking the personification of large language models.arXiv preprint arXiv:2402.18180, 2024

Qiuejie Xie, Qiming Feng, Tianqi Zhang, Qingqiu Li, Linyi Yang, Yuejie Zhang, Rui Feng, Liang He, Shang Gao, and Yue Zhang. Human simulacra: Benchmarking the personification of large language models.arXiv preprint arXiv:2402.18180, 2024

work page arXiv 2024
[51]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Evaluating large language models as generative user simulators for conversational recommendation

Se-eun Yoon, Zhankui He, Jessica Echterhoff, and Julian McAuley. Evaluating large language models as generative user simulators for conversational recommendation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1490–1504, 2024

work page 2024
[54]

Tenrec: A large-scale multipurpose benchmark dataset for rec- ommender systems.Advances in Neural Information Processing Systems, 35:11480–11493, 2022

Guanghu Yuan, Fajie Yuan, Yudong Li, Beibei Kong, Shujie Li, Lei Chen, Min Yang, Chenyun Yu, Bo Hu, Zang Li, et al. Tenrec: A large-scale multipurpose benchmark dataset for rec- ommender systems.Advances in Neural Information Processing Systems, 35:11480–11493, 2022

work page 2022
[55]

Glm-4.7: Advancing the coding capability

Z.ai. Glm-4.7: Advancing the coding capability. https://z.ai/blog/glm-4.7, December 2025

work page 2025
[56]

Agentcf: Collaborative learning with autonomous language agents for recommender systems

Junjie Zhang, Yupeng Hou, Ruobing Xie, Wenqi Sun, Julian McAuley, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. Agentcf: Collaborative learning with autonomous language agents for recommender systems. InProceedings of the ACM Web Conference 2024, pages 3679–3689, 2024

work page 2024
[57]

Ai-salesman: Towards reliable large language model driven telemarketing

Qingyu Zhang, Chunlei Xin, Xuanang Chen, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Qing Ye, Qianlong Xie, and Xingxing Wang. Ai-salesman: Towards reliable large language model driven telemarketing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34790–34798, 2026

work page 2026
[58]

Evaluating conversational recommender systems via user simulation

Shuo Zhang and Krisztian Balog. Evaluating conversational recommender systems via user simulation. InProceedings of the 26th acm sigkdd international conference on knowledge discovery & data mining, pages 1512–1520, 2020

work page 2020
[59]

The jddc 2.0 corpus: A large-scale multimodal multi-turn chinese dialogue dataset for e-commerce customer service

Nan Zhao, Haoran Li, Youzheng Wu, Xiaodong He, and Bowen Zhou. The jddc 2.0 corpus: A large-scale multimodal multi-turn chinese dialogue dataset for e-commerce customer service. arXiv preprint arXiv:2109.12913, 2021

work page arXiv 2021
[60]

Could you

Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1059–1068, 2018. 15 A Data Statistics A.1 Action Sequence Length Distribution F...

work page 2018
[61]

Live streaming Type

Live streaming type: What type of live streaming is this (E-commerce / gaming / chatting / talent performance, etc.) 2. Host characteristics: The host’s basic appearance features 3. Image text: Extract key text from the cover (**Note: Only extract core text such as live streaming title, product names, prices, promotional information, etc. Do NOT extract b...

work page
[62]

Beauty,” “Games,

One concise category (Category), such as “Beauty,” “Games,” “News,” etc

work page
[63]

text" Output Format: Return only a JSON object containing two fields:

Three specific keywords (Keywords). Ignore the interactive form of the text. Even if it is casual chat between friends, look beyond the social surface and identify the underlying topic being discussed. Content: "text" Output Format: Return only a JSON object containing two fields: "category" and "keywords" (a list of strings). Example: "category": "Techno...

work page
[64]

uh,” “ah,

Noise Removal: * Remove meaningless garbled characters (e.g., AC:BU526, IC·BQ528, within 50 meters, and other interfering information). * Filter excessively redundant filler words, such as repeated occurrences of “uh,” “ah,” “that is to say,” retaining only those necessary for context

work page
[65]

cumin cowhide

Semantic Correction: * Correct obvious recognition errors (e.g., change “cumin cowhide” to “naturally revealed,” or infer based on context; if the correct meaning cannot be determined, keep the original). * Complete broken sentences and add commas, periods, or question marks appropriately based on tone and emphasis. 4. Formatting Standards: * Unify full-w...

work page 2025

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Introducing claude 4

Anthropic. Introducing claude 4. https://www.anthropic.com/news/claude-4, May 2025

work page 2025

[3] [3]

Introducing claude haiku 4.5

Anthropic. Introducing claude haiku 4.5. https://www.anthropic.com/news/claude-haiku-4-5, October 2025

work page 2025

[4] [4]

Introducing claude opus 4.5

Anthropic. Introducing claude opus 4.5. https://www.anthropic.com/news/claude-opus-4-5, November 2025

work page 2025

[5] [5]

Introducing claude sonnet 4.5

Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/claude-sonnet-4-5, September 2025

work page 2025

[6] [6]

Designing economic agents that act like human agents: A behavioral approach to bounded rationality.The American economic review, 81(2):353–359, 1991

W Brian Arthur. Designing economic agents that act like human agents: A behavioral approach to bounded rationality.The American economic review, 81(2):353–359, 1991

work page 1991

[7] [7]

The netflix prize

James Bennett and Stan Lanning. The netflix prize. 2007

work page 2007

[8] [8]

Simulations in recommender systems: An industry perspective.arXiv preprint arXiv:2109.06723, 2021

Lucas Bernardi, Sakshi Batra, and Cintia Alicia Bruscantini. Simulations in recommender systems: An industry perspective.arXiv preprint arXiv:2109.06723, 2021

work page arXiv 2021

[9] [9]

Simuser: Simulating user behavior with large language models for recommender system evaluation

Nicolas Bougie and Narimawa Watanabe. Simuser: Simulating user behavior with large language models for recommender system evaluation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 43–60, 2025

work page 2025

[10] [10]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Consistentchat: Building skeleton-guided consistent multi-turn dialogues for large language models from scratch

Jiawei Chen, Xinyan Guan, Qianhao Yuan, Guozhao Mo, Weixiang Zhou, Yaojie Lu, Hongyu Lin, Ben He, Le Sun, and Xianpei Han. Consistentchat: Building skeleton-guided consistent multi-turn dialogues for large language models from scratch. InThe 2025 Conference on Empirical Methods in Natural Language Processing, 2025. 12

work page 2025

[12] [12]

Simulation of individual and group behavior.The American Economic Review, pages 920–932, 1960

Geoffrey PE Clarkson and Herbert A Simon. Simulation of individual and group behavior.The American Economic Review, pages 920–932, 1960

work page 1960

[13] [13]

A computational approach to politeness with application to social factors

Cristian Danescu, Niculescu Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christo- pher Potts. A computational approach to politeness with application to social factors. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 250–259, 2013

work page 2013

[14] [14]

Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024

Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024

work page 2024

[15] [15]

Kuairec: A fully-observed dataset and insights for evaluating recommender systems

Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. Kuairec: A fully-observed dataset and insights for evaluating recommender systems. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 540–550, 2022

work page 2022

[16] [16]

Gemini 3 flash: frontier intelligence built for speed

Google. Gemini 3 flash: frontier intelligence built for speed. https://blog.google/products-and- platforms/products/gemini/gemini-3-flash/, December 2025

work page 2025

[17] [17]

The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015

F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015

work page 2015

[18] [18]

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, and Paul Röttger. Simbench: Benchmarking the ability of large language models to simulate human behaviors. arXiv preprint arXiv:2510.17516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Recsim: A configurable simulation platform for recommender systems

Eugene Ie, Chih-wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. Recsim: A configurable simulation platform for recommender systems. arXiv preprint arXiv:1909.04847, 2019

work page arXiv 1909

[21] [21]

Surrealdriver: Designing llm-powered generative driver agent framework based on human drivers’ driving-thinking data,

Ye Jin, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue Zhou, and Jiangtao Gong. Surrealdriver: Designing generative driver agent simulation framework in urban contexts based on large language model.arXiv preprint arXiv:2309.13193, 5(7):8, 2023

work page arXiv 2023

[22] [22]

Sim4ia-bench: A user simulation benchmark suite for next query and utterance prediction.arXiv preprint arXiv:2511.09329, 2025

Andreas Konstantin Kruff, Christin Katharina Kreutz, Timo Breuer, Philipp Schaer, and Krisz- tian Balog. Sim4ia-bench: A user simulation benchmark suite for next query and utterance prediction.arXiv preprint arXiv:2511.09329, 2025

work page arXiv 2025

[23] [23]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023

[24] [24]

Artificial intelligence and simulation: An introduction

R Greer Lavery. Artificial intelligence and simulation: An introduction. InProceedings of the 18th conference on Winter simulation, pages 448–452, 1986

work page 1986

[25] [25]

Field theory in social science: selected theoretical papers (edited by dorwin cartwright.)

Kurt Lewin. Field theory in social science: selected theoretical papers (edited by dorwin cartwright.). 1951

work page 1951

[26] [26]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024. 13

work page 2024

[28] [28]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Langchain

Vasilios Mavroudis. Langchain. 2024

work page 2024

[30] [30]

The place of modeling in cognitive science.Topics in Cognitive Science, 1(1):11–38, 2009

James L McClelland. The place of modeling in cognitive science.Topics in Cognitive Science, 1(1):11–38, 2009

work page 2009

[31] [31]

Some methods of classification and analysis of multivariate observations

James B McQueen. Some methods of classification and analysis of multivariate observations. InProc. of 5th Berkeley Symposium on Math. Stat. and Prob., pages 281–297, 1967

work page 1967

[32] [32]

Unveiling the truth and facilitating change: To- wards agent-based large-scale social movement simulation

X Mou, Z Wei, and X Huang. Unveiling the truth and facilitating change: Towards agent-based large-scale social movement simulation. arxiv 2024.arXiv preprint arXiv:2402.16333

work page arXiv 2024

[33] [33]

Cognitive science and artificial intelligence: simulating the human mind and its complexity.Cognitive Computation and Systems, 1(4):113–116, 2019

Mohd Naveed Uddin. Cognitive science and artificial intelligence: simulating the human mind and its complexity.Cognitive Computation and Systems, 1(4):113–116, 2019

work page 2019

[34] [34]

OpenAI. Gpt-5.2. https://openai.com/zh-Hans-CN/index/introducing-gpt-5-2/, 2025

work page 2025

[35] [35]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023

[36] [36]

Artificial intelligence and virtual worlds–toward human-level ai agents

Vladimir M Petrovi´c. Artificial intelligence and virtual worlds–toward human-level ai agents. IEEE Access, 6:39976–39988, 2018

work page 2018

[37] [37]

Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction

Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. InProceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 2685–2692, 2020

work page 2020

[38] [38]

Computational politeness in natural language processing: A survey.ACM Computing Surveys, 56(9):1–42, 2024

Priyanshu Priya, Mauajama Firdaus, and Asif Ekbal. Computational politeness in natural language processing: A survey.ACM Computing Surveys, 56(9):1–42, 2024

work page 2024

[39] [39]

KuaiLive: A Real-time Interactive Dataset for Live Streaming Recommendation

Changle Qu, Sunhao Dai, Ke Guo, Liqin Zhao, Yanan Niu, Xiao Zhang, and Jun Xu. Kuailive: A real-time interactive dataset for live streaming recommendation.arXiv preprint arXiv:2508.05633, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

A plea for (good) simulations: nudging economics toward an experimental science

Julian Reiss. A plea for (good) simulations: nudging economics toward an experimental science. Simulation & gaming, 42(2):243–264, 2011

work page 2011

[41] [41]

Bases: Large-scale web search user simulation with large language model based agents.arXiv preprint arXiv:2402.17505, 2024

Ruiyang Ren, Peng Qiu, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Hua Wu, Ji-Rong Wen, and Haifeng Wang. Bases: Large-scale web search user simulation with large language model based agents.arXiv preprint arXiv:2402.17505, 2024

work page arXiv 2024

[42] [42]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning

Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and An-Xiang Zeng. Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4902–4909, 2019

work page 2019

[44] [44]

Integrating llm in agent-based social simulation: Opportunities and challenges.arXiv preprint arXiv:2507.19364, 2025

Patrick Taillandier, Jean Daniel Zucker, Arnaud Grignard, Benoit Gaudou, Nghi Quang Huynh, and Alexis Drogoul. Integrating llm in agent-based social simulation: Opportunities and challenges.arXiv preprint arXiv:2507.19364, 2025

work page arXiv 2025

[45] [45]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

User behavior simulation with large language model based agents

Lei Wang, Jingsen Zhang, Hao Yang, Zhiyuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Ruihua Song, Wayne Xin Zhao, et al. When large language model based agent meets user behavior analysis: A novel user simulation paradigm.arXiv preprint arXiv:2306.02552, 2023

work page arXiv 2023

[47] [47]

Characterbox: Evaluating the role-playing capabilities of llms in text-based virtual worlds

Lei Wang, Jianxun Lian, Yi Huang, Yanqi Dai, Haoxuan Li, Xu Chen, Xing Xie, and Ji-Rong Wen. Characterbox: Evaluating the role-playing capabilities of llms in text-based virtual worlds. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long P...

work page 2025

[48] [48]

Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models

Noah Wang, Zy Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, et al. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 14743–14777, 2024

work page 2024

[49] [49]

A survey on llm-based agents for social simulation: Taxonomy, evaluation and applications

Zixu Wang, Bin Xie, Bingbing Xu, Shengmao Zhu, Yige Yuan, Liang Pang, Long Yang Du Su, Zixuan Li, Huawei Shen, and Xueqi Cheng. A survey on llm-based agents for social simulation: Taxonomy, evaluation and applications

work page

[50] [50]

Human simulacra: Benchmarking the personification of large language models.arXiv preprint arXiv:2402.18180, 2024

Qiuejie Xie, Qiming Feng, Tianqi Zhang, Qingqiu Li, Linyi Yang, Yuejie Zhang, Rui Feng, Liang He, Shang Gao, and Yue Zhang. Human simulacra: Benchmarking the personification of large language models.arXiv preprint arXiv:2402.18180, 2024

work page arXiv 2024

[51] [51]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Evaluating large language models as generative user simulators for conversational recommendation

Se-eun Yoon, Zhankui He, Jessica Echterhoff, and Julian McAuley. Evaluating large language models as generative user simulators for conversational recommendation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1490–1504, 2024

work page 2024

[54] [54]

Tenrec: A large-scale multipurpose benchmark dataset for rec- ommender systems.Advances in Neural Information Processing Systems, 35:11480–11493, 2022

Guanghu Yuan, Fajie Yuan, Yudong Li, Beibei Kong, Shujie Li, Lei Chen, Min Yang, Chenyun Yu, Bo Hu, Zang Li, et al. Tenrec: A large-scale multipurpose benchmark dataset for rec- ommender systems.Advances in Neural Information Processing Systems, 35:11480–11493, 2022

work page 2022

[55] [55]

Glm-4.7: Advancing the coding capability

Z.ai. Glm-4.7: Advancing the coding capability. https://z.ai/blog/glm-4.7, December 2025

work page 2025

[56] [56]

Agentcf: Collaborative learning with autonomous language agents for recommender systems

Junjie Zhang, Yupeng Hou, Ruobing Xie, Wenqi Sun, Julian McAuley, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. Agentcf: Collaborative learning with autonomous language agents for recommender systems. InProceedings of the ACM Web Conference 2024, pages 3679–3689, 2024

work page 2024

[57] [57]

Ai-salesman: Towards reliable large language model driven telemarketing

Qingyu Zhang, Chunlei Xin, Xuanang Chen, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Qing Ye, Qianlong Xie, and Xingxing Wang. Ai-salesman: Towards reliable large language model driven telemarketing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34790–34798, 2026

work page 2026

[58] [58]

Evaluating conversational recommender systems via user simulation

Shuo Zhang and Krisztian Balog. Evaluating conversational recommender systems via user simulation. InProceedings of the 26th acm sigkdd international conference on knowledge discovery & data mining, pages 1512–1520, 2020

work page 2020

[59] [59]

The jddc 2.0 corpus: A large-scale multimodal multi-turn chinese dialogue dataset for e-commerce customer service

Nan Zhao, Haoran Li, Youzheng Wu, Xiaodong He, and Bowen Zhou. The jddc 2.0 corpus: A large-scale multimodal multi-turn chinese dialogue dataset for e-commerce customer service. arXiv preprint arXiv:2109.12913, 2021

work page arXiv 2021

[60] [60]

Could you

Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1059–1068, 2018. 15 A Data Statistics A.1 Action Sequence Length Distribution F...

work page 2018

[61] [61]

Live streaming Type

Live streaming type: What type of live streaming is this (E-commerce / gaming / chatting / talent performance, etc.) 2. Host characteristics: The host’s basic appearance features 3. Image text: Extract key text from the cover (**Note: Only extract core text such as live streaming title, product names, prices, promotional information, etc. Do NOT extract b...

work page

[62] [62]

Beauty,” “Games,

One concise category (Category), such as “Beauty,” “Games,” “News,” etc

work page

[63] [63]

text" Output Format: Return only a JSON object containing two fields:

Three specific keywords (Keywords). Ignore the interactive form of the text. Even if it is casual chat between friends, look beyond the social surface and identify the underlying topic being discussed. Content: "text" Output Format: Return only a JSON object containing two fields: "category" and "keywords" (a list of strings). Example: "category": "Techno...

work page

[64] [64]

uh,” “ah,

Noise Removal: * Remove meaningless garbled characters (e.g., AC:BU526, IC·BQ528, within 50 meters, and other interfering information). * Filter excessively redundant filler words, such as repeated occurrences of “uh,” “ah,” “that is to say,” retaining only those necessary for context

work page

[65] [65]

cumin cowhide

Semantic Correction: * Correct obvious recognition errors (e.g., change “cumin cowhide” to “naturally revealed,” or infer based on context; if the correct meaning cannot be determined, keep the original). * Complete broken sentences and add commas, periods, or question marks appropriately based on tone and emphasis. 4. Formatting Standards: * Unify full-w...

work page 2025