Recognition: 2 theorem links
· Lean TheoremTowards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
Pith reviewed 2026-05-10 18:16 UTC · model grok-4.3
The pith
Large language models simulating human behavior converge to a positive average person and erase individual differences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using the OmniBehavior benchmark built from real-world long-horizon, cross-scenario, and heterogeneous behavioral traces, state-of-the-art LLMs are shown to struggle at accurate simulation even as context length grows, and direct comparison to authentic data reveals a structural bias in which models converge toward a positive average person, producing hyper-activity, persona homogenization, and Utopian bias that erases individual differences and long-tail behaviors.
What carries the argument
The OmniBehavior benchmark, which unifies long-horizon, cross-scenario, and heterogeneous real-world behavioral traces into a single evaluation framework for comparing simulated outputs against authentic decision sequences.
If this is right
- Prior isolated-scenario benchmarks produce tunnel vision that does not reflect how real decisions form across linked scenarios over time.
- LLM simulation performance plateaus rather than improves once context windows are enlarged.
- High-fidelity user simulation will require targeted methods to restore individual differences and long-tail behaviors.
- Applications relying on behavioral simulation inherit the same convergence to averaged positive patterns.
Where Pith is reading between the lines
- The bias may stem from training data that over-represents normative or desirable outcomes, suggesting targeted data augmentation with diverse real traces as one remedy.
- Downstream tasks such as agent-based social modeling or predictive user interfaces would systematically under-represent risk-taking or atypical choices.
- Repeating the comparison on non-LLM simulators or on future model families could isolate whether the convergence is specific to current autoregressive architectures.
Load-bearing premise
The collected real-world behavioral traces represent holistic human decision-making without meaningful collection or annotation biases, and the metrics for activity, homogenization, and positivity accurately reflect model limitations rather than benchmark artifacts.
What would settle it
Re-running the same LLM simulations on the benchmark traces and finding that activity levels, persona variance, and outcome positivity distributions match the real data within measurement error.
Figures
read the original abstract
The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OmniBehavior, the first benchmark for LLM-based user simulation constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral traces. It argues that prior benchmarks suffer from tunnel vision due to isolated scenarios, provides empirical evidence that LLMs struggle to simulate authentic complex behaviors (with performance plateauing despite larger context windows), and identifies a structural bias wherein LLMs converge toward a 'positive average person' via hyper-activity, persona homogenization, and Utopian bias, resulting in loss of individual differences and long-tail behaviors.
Significance. If the central claims hold after addressing data and metric robustness, the work would be significant for establishing a more realistic evaluation framework for human behavior simulation and for documenting concrete limitations in current LLMs' ability to capture behavioral heterogeneity. This could usefully direct research toward better fidelity in agent and user modeling. The contribution is tempered by the absence of detailed validation for the real traces and bias metrics, which directly affects how much weight the structural-bias conclusion can carry.
major comments (2)
- [§3] §3 (Benchmark Construction): The manuscript provides no details on data collection protocols, sample sizes, participant selection criteria, annotation procedures, or statistical methods used to build the real-world traces. This is load-bearing for the central claim, as the comparison of LLM outputs to 'authentic' behaviors and the identification of structural biases presuppose that the collected traces are an unbiased, representative sample of heterogeneous long-horizon decision-making.
- [§5] §5 (Bias Analysis): The quantitative definitions and measurement procedures for hyper-activity, persona homogenization, and Utopian bias are not formalized (no equations or explicit aggregation rules are given). The reported convergence to a 'positive average person' appears to use global means without per-person baselines or robustness checks against alternative labeling/aggregation choices; this leaves open the possibility that the observed gaps are benchmark-construction artifacts rather than intrinsic LLM properties, directly undermining the 'fundamental structural bias' conclusion.
minor comments (2)
- [Abstract] The abstract states that 'performance plateauing' occurs as context windows expand but does not report the specific context lengths tested, the exact performance metrics (e.g., action prediction accuracy, sequence similarity), or the statistical significance of the plateau.
- [Figures/Tables] Figure and table captions could more explicitly link visual results to the three named biases (hyper-activity, homogenization, Utopian bias) to improve readability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments identify areas where additional clarity and formalization will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns about data provenance and metric definitions.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction): The manuscript provides no details on data collection protocols, sample sizes, participant selection criteria, annotation procedures, or statistical methods used to build the real-world traces. This is load-bearing for the central claim, as the comparison of LLM outputs to 'authentic' behaviors and the identification of structural biases presuppose that the collected traces are an unbiased, representative sample of heterogeneous long-horizon decision-making.
Authors: We agree that Section 3 would benefit from expanded documentation of the data pipeline. In the revised manuscript we will add a dedicated subsection titled 'Trace Collection and Validation' that specifies: (i) participant recruitment channels and inclusion/exclusion criteria, (ii) exact sample sizes (number of users, total trace hours, and cross-scenario coverage), (iii) logging protocols and consent procedures, (iv) any post-collection annotation or scenario labeling steps, and (v) statistical checks performed to assess representativeness and heterogeneity. These additions will allow readers to evaluate the degree to which the traces support the authenticity claims. revision: yes
-
Referee: [§5] §5 (Bias Analysis): The quantitative definitions and measurement procedures for hyper-activity, persona homogenization, and Utopian bias are not formalized (no equations or explicit aggregation rules are given). The reported convergence to a 'positive average person' appears to use global means without per-person baselines or robustness checks against alternative labeling/aggregation choices; this leaves open the possibility that the observed gaps are benchmark-construction artifacts rather than intrinsic LLM properties, directly undermining the 'fundamental structural bias' conclusion.
Authors: We accept that the current presentation of the bias metrics lacks sufficient formalization. The revision will introduce explicit equations and aggregation rules for each bias: hyper-activity will be defined as the per-user deviation in action rate relative to the corresponding real trace; persona homogenization will be quantified via the reduction in behavioral embedding variance across simulated versus real individuals; and Utopian bias will be measured by a positivity score derived from outcome sentiment. In addition, we will report per-person baseline comparisons (rather than solely global means) and include robustness analyses that vary labeling granularity and aggregation functions (mean vs. median, different embedding models). These changes will directly test whether the observed convergence is robust or an artifact of the chosen metrics. revision: partial
Circularity Check
No circularity: claims rest on direct empirical comparison to independent real-world traces
full rationale
The paper constructs OmniBehavior from real-world data and evaluates LLMs via direct comparison of simulated outputs against held-out authentic behavioral traces. The reported structural biases (hyper-activity, persona homogenization, Utopian bias) are presented as outcomes of this external comparison rather than quantities derived by fitting parameters, self-defining metrics, or reducing via self-citation chains within the study. No equations, ansatzes, or uniqueness theorems are invoked that collapse the central claims back to the paper's own inputs by construction. The derivation chain remains self-contained against the benchmark data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Real-world behavioral traces can be assembled into a unified benchmark that faithfully captures long-horizon cross-scenario heterogeneity without significant selection or annotation artifacts
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
current LLMs exhibit a substantial capability gap in modeling real-world user behaviors, regardless of context length
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Introducing claude 4
Anthropic. Introducing claude 4. https://www.anthropic.com/news/claude-4, May 2025
2025
-
[3]
Introducing claude haiku 4.5
Anthropic. Introducing claude haiku 4.5. https://www.anthropic.com/news/claude-haiku-4-5, October 2025
2025
-
[4]
Introducing claude opus 4.5
Anthropic. Introducing claude opus 4.5. https://www.anthropic.com/news/claude-opus-4-5, November 2025
2025
-
[5]
Introducing claude sonnet 4.5
Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/claude-sonnet-4-5, September 2025
2025
-
[6]
Designing economic agents that act like human agents: A behavioral approach to bounded rationality.The American economic review, 81(2):353–359, 1991
W Brian Arthur. Designing economic agents that act like human agents: A behavioral approach to bounded rationality.The American economic review, 81(2):353–359, 1991
1991
-
[7]
The netflix prize
James Bennett and Stan Lanning. The netflix prize. 2007
2007
-
[8]
Simulations in recommender systems: An industry perspective.arXiv preprint arXiv:2109.06723, 2021
Lucas Bernardi, Sakshi Batra, and Cintia Alicia Bruscantini. Simulations in recommender systems: An industry perspective.arXiv preprint arXiv:2109.06723, 2021
-
[9]
Simuser: Simulating user behavior with large language models for recommender system evaluation
Nicolas Bougie and Narimawa Watanabe. Simuser: Simulating user behavior with large language models for recommender system evaluation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 43–60, 2025
2025
-
[10]
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023
work page internal anchor Pith review arXiv 2023
-
[11]
Consistentchat: Building skeleton-guided consistent multi-turn dialogues for large language models from scratch
Jiawei Chen, Xinyan Guan, Qianhao Yuan, Guozhao Mo, Weixiang Zhou, Yaojie Lu, Hongyu Lin, Ben He, Le Sun, and Xianpei Han. Consistentchat: Building skeleton-guided consistent multi-turn dialogues for large language models from scratch. InThe 2025 Conference on Empirical Methods in Natural Language Processing, 2025. 12
2025
-
[12]
Simulation of individual and group behavior.The American Economic Review, pages 920–932, 1960
Geoffrey PE Clarkson and Herbert A Simon. Simulation of individual and group behavior.The American Economic Review, pages 920–932, 1960
1960
-
[13]
A computational approach to politeness with application to social factors
Cristian Danescu, Niculescu Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christo- pher Potts. A computational approach to politeness with application to social factors. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 250–259, 2013
2013
-
[14]
Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024
Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024
2024
-
[15]
Kuairec: A fully-observed dataset and insights for evaluating recommender systems
Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. Kuairec: A fully-observed dataset and insights for evaluating recommender systems. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 540–550, 2022
2022
-
[16]
Gemini 3 flash: frontier intelligence built for speed
Google. Gemini 3 flash: frontier intelligence built for speed. https://blog.google/products-and- platforms/products/gemini/gemini-3-flash/, December 2025
2025
-
[17]
The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015
F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015
2015
-
[18]
SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, and Paul Röttger. Simbench: Benchmarking the ability of large language models to simulate human behaviors. arXiv preprint arXiv:2510.17516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Recsim: A configurable simulation platform for recommender systems
Eugene Ie, Chih-wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. Recsim: A configurable simulation platform for recommender systems. arXiv preprint arXiv:1909.04847, 2019
-
[21]
Ye Jin, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue Zhou, and Jiangtao Gong. Surrealdriver: Designing generative driver agent simulation framework in urban contexts based on large language model.arXiv preprint arXiv:2309.13193, 5(7):8, 2023
-
[22]
Andreas Konstantin Kruff, Christin Katharina Kreutz, Timo Breuer, Philipp Schaer, and Krisz- tian Balog. Sim4ia-bench: A user simulation benchmark suite for next query and utterance prediction.arXiv preprint arXiv:2511.09329, 2025
-
[23]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023
2023
-
[24]
Artificial intelligence and simulation: An introduction
R Greer Lavery. Artificial intelligence and simulation: An introduction. InProceedings of the 18th conference on Winter simulation, pages 448–452, 1986
1986
-
[25]
Field theory in social science: selected theoretical papers (edited by dorwin cartwright.)
Kurt Lewin. Field theory in social science: selected theoretical papers (edited by dorwin cartwright.). 1951
1951
-
[26]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024. 13
2024
-
[28]
Evaluating Very Long-Term Conversational Memory of LLM Agents
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024
work page internal anchor Pith review arXiv 2024
-
[29]
Langchain
Vasilios Mavroudis. Langchain. 2024
2024
-
[30]
The place of modeling in cognitive science.Topics in Cognitive Science, 1(1):11–38, 2009
James L McClelland. The place of modeling in cognitive science.Topics in Cognitive Science, 1(1):11–38, 2009
2009
-
[31]
Some methods of classification and analysis of multivariate observations
James B McQueen. Some methods of classification and analysis of multivariate observations. InProc. of 5th Berkeley Symposium on Math. Stat. and Prob., pages 281–297, 1967
1967
-
[32]
arXiv preprint arXiv:2402.16333 , year=
X Mou, Z Wei, and X Huang. Unveiling the truth and facilitating change: Towards agent-based large-scale social movement simulation. arxiv 2024.arXiv preprint arXiv:2402.16333
-
[33]
Cognitive science and artificial intelligence: simulating the human mind and its complexity.Cognitive Computation and Systems, 1(4):113–116, 2019
Mohd Naveed Uddin. Cognitive science and artificial intelligence: simulating the human mind and its complexity.Cognitive Computation and Systems, 1(4):113–116, 2019
2019
-
[34]
OpenAI. Gpt-5.2. https://openai.com/zh-Hans-CN/index/introducing-gpt-5-2/, 2025
2025
-
[35]
Generative agents: Interactive simulacra of human behavior
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023
2023
-
[36]
Artificial intelligence and virtual worlds–toward human-level ai agents
Vladimir M Petrovi´c. Artificial intelligence and virtual worlds–toward human-level ai agents. IEEE Access, 6:39976–39988, 2018
2018
-
[37]
Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction
Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. InProceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 2685–2692, 2020
2020
-
[38]
Computational politeness in natural language processing: A survey.ACM Computing Surveys, 56(9):1–42, 2024
Priyanshu Priya, Mauajama Firdaus, and Asif Ekbal. Computational politeness in natural language processing: A survey.ACM Computing Surveys, 56(9):1–42, 2024
2024
-
[39]
KuaiLive: A Real-time Interactive Dataset for Live Streaming Recommendation
Changle Qu, Sunhao Dai, Ke Guo, Liqin Zhao, Yanan Niu, Xiao Zhang, and Jun Xu. Kuailive: A real-time interactive dataset for live streaming recommendation.arXiv preprint arXiv:2508.05633, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
A plea for (good) simulations: nudging economics toward an experimental science
Julian Reiss. A plea for (good) simulations: nudging economics toward an experimental science. Simulation & gaming, 42(2):243–264, 2011
2011
-
[41]
Ruiyang Ren, Peng Qiu, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Hua Wu, Ji-Rong Wen, and Haifeng Wang. Bases: Large-scale web search user simulation with large language model based agents.arXiv preprint arXiv:2402.17505, 2024
-
[42]
Towards Understanding Sycophancy in Language Models
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023
work page internal anchor Pith review arXiv 2023
-
[43]
Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning
Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and An-Xiang Zeng. Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4902–4909, 2019
2019
-
[44]
Patrick Taillandier, Jean Daniel Zucker, Arnaud Grignard, Benoit Gaudou, Nghi Quang Huynh, and Alexis Drogoul. Integrating llm in agent-based social simulation: Opportunities and challenges.arXiv preprint arXiv:2507.19364, 2025
-
[45]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Recagent: A novel simulation paradigm for recommender systems
Lei Wang, Jingsen Zhang, Hao Yang, Zhiyuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Ruihua Song, Wayne Xin Zhao, et al. When large language model based agent meets user behavior analysis: A novel user simulation paradigm.arXiv preprint arXiv:2306.02552, 2023
-
[47]
Characterbox: Evaluating the role-playing capabilities of llms in text-based virtual worlds
Lei Wang, Jianxun Lian, Yi Huang, Yanqi Dai, Haoxuan Li, Xu Chen, Xing Xie, and Ji-Rong Wen. Characterbox: Evaluating the role-playing capabilities of llms in text-based virtual worlds. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long P...
2025
-
[48]
Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models
Noah Wang, Zy Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, et al. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 14743–14777, 2024
2024
-
[49]
A survey on llm-based agents for social simulation: Taxonomy, evaluation and applications
Zixu Wang, Bin Xie, Bingbing Xu, Shengmao Zhu, Yige Yuan, Liang Pang, Long Yang Du Su, Zixuan Li, Huawei Shen, and Xueqi Cheng. A survey on llm-based agents for social simulation: Taxonomy, evaluation and applications
-
[50]
Qiuejie Xie, Qiming Feng, Tianqi Zhang, Qingqiu Li, Linyi Yang, Yuejie Zhang, Rui Feng, Liang He, Shang Gao, and Yue Zhang. Human simulacra: Benchmarking the personification of large language models.arXiv preprint arXiv:2402.18180, 2024
-
[51]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383, 2025
work page internal anchor Pith review arXiv 2025
-
[53]
Evaluating large language models as generative user simulators for conversational recommendation
Se-eun Yoon, Zhankui He, Jessica Echterhoff, and Julian McAuley. Evaluating large language models as generative user simulators for conversational recommendation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1490–1504, 2024
2024
-
[54]
Tenrec: A large-scale multipurpose benchmark dataset for rec- ommender systems.Advances in Neural Information Processing Systems, 35:11480–11493, 2022
Guanghu Yuan, Fajie Yuan, Yudong Li, Beibei Kong, Shujie Li, Lei Chen, Min Yang, Chenyun Yu, Bo Hu, Zang Li, et al. Tenrec: A large-scale multipurpose benchmark dataset for rec- ommender systems.Advances in Neural Information Processing Systems, 35:11480–11493, 2022
2022
-
[55]
Glm-4.7: Advancing the coding capability
Z.ai. Glm-4.7: Advancing the coding capability. https://z.ai/blog/glm-4.7, December 2025
2025
-
[56]
Agentcf: Collaborative learning with autonomous language agents for recommender systems
Junjie Zhang, Yupeng Hou, Ruobing Xie, Wenqi Sun, Julian McAuley, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. Agentcf: Collaborative learning with autonomous language agents for recommender systems. InProceedings of the ACM Web Conference 2024, pages 3679–3689, 2024
2024
-
[57]
Ai-salesman: Towards reliable large language model driven telemarketing
Qingyu Zhang, Chunlei Xin, Xuanang Chen, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Qing Ye, Qianlong Xie, and Xingxing Wang. Ai-salesman: Towards reliable large language model driven telemarketing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34790–34798, 2026
2026
-
[58]
Evaluating conversational recommender systems via user simulation
Shuo Zhang and Krisztian Balog. Evaluating conversational recommender systems via user simulation. InProceedings of the 26th acm sigkdd international conference on knowledge discovery & data mining, pages 1512–1520, 2020
2020
-
[59]
Nan Zhao, Haoran Li, Youzheng Wu, Xiaodong He, and Bowen Zhou. The jddc 2.0 corpus: A large-scale multimodal multi-turn chinese dialogue dataset for e-commerce customer service. arXiv preprint arXiv:2109.12913, 2021
-
[60]
Could you
Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1059–1068, 2018. 15 A Data Statistics A.1 Action Sequence Length Distribution F...
2018
-
[61]
Live streaming Type
Live streaming type: What type of live streaming is this (E-commerce / gaming / chatting / talent performance, etc.) 2. Host characteristics: The host’s basic appearance features 3. Image text: Extract key text from the cover (**Note: Only extract core text such as live streaming title, product names, prices, promotional information, etc. Do NOT extract b...
-
[62]
Beauty,” “Games,
One concise category (Category), such as “Beauty,” “Games,” “News,” etc
-
[63]
text" Output Format: Return only a JSON object containing two fields:
Three specific keywords (Keywords). Ignore the interactive form of the text. Even if it is casual chat between friends, look beyond the social surface and identify the underlying topic being discussed. Content: "text" Output Format: Return only a JSON object containing two fields: "category" and "keywords" (a list of strings). Example: "category": "Techno...
-
[64]
uh,” “ah,
Noise Removal: * Remove meaningless garbled characters (e.g., AC:BU526, IC·BQ528, within 50 meters, and other interfering information). * Filter excessively redundant filler words, such as repeated occurrences of “uh,” “ah,” “that is to say,” retaining only those necessary for context
-
[65]
cumin cowhide
Semantic Correction: * Correct obvious recognition errors (e.g., change “cumin cowhide” to “naturally revealed,” or infer based on context; if the correct meaning cannot be determined, keep the original). * Complete broken sentences and add commas, periods, or question marks appropriately based on tone and emphasis. 4. Formatting Standards: * Unify full-w...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.