Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
Pith reviewed 2026-05-22 10:26 UTC · model grok-4.3
The pith
LLMs simulating real human behavior converge toward a positive average person and erase individual differences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a utopian bias. This results in the loss of individual differences and long-tail behaviors.
What carries the argument
The OmniBehavior benchmark, which assembles long-horizon, cross-scenario, and heterogeneous behavioral patterns directly from real-world data to serve as ground truth for simulation fidelity.
If this is right
- Isolated-scenario datasets create tunnel vision that hides the cross-scenario causal chains present in real decision-making.
- LLM simulation performance plateaus even when context windows are enlarged.
- The structural bias produces outputs that systematically omit the low-frequency behaviors observed in authentic traces.
- High-fidelity simulation will require explicit mechanisms to preserve individual differences rather than defaulting to an averaged persona.
Where Pith is reading between the lines
- The same convergence may appear in other generative tasks that rely on modeling user preferences or sequences, such as personalized recommendation or dialogue systems.
- A practical extension would be to add explicit regularization or retrieval steps that force models to reproduce measured frequencies of rare actions from the source traces.
- Testing whether the bias persists when models are given explicit negative or low-activity examples from the same data would clarify whether the issue is data scarcity or architectural.
- If the homogenization is confirmed across multiple languages or cultures, it would indicate a training-data skew rather than a language-specific artifact.
Load-bearing premise
The real-world behavioral traces collected for OmniBehavior accurately and representatively capture authentic long-horizon, cross-scenario human decision-making without significant selection or measurement biases.
What would settle it
Collect a fresh set of long-horizon traces from a demographically different population that explicitly includes documented long-tail decisions, then measure whether LLM outputs still flatten those decisions into positive averages.
Figures
read the original abstract
The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OmniBehavior, a benchmark for LLM-based user simulation constructed entirely from real-world long-horizon, cross-scenario, and heterogeneous behavioral traces. It argues that existing isolated-scenario datasets suffer from tunnel vision compared to real-world causal chains, shows that state-of-the-art LLMs struggle to simulate these behaviors with performance plateauing despite larger context windows, and identifies a structural bias in LLMs toward a positive average person, manifested as hyper-activity, persona homogenization, and utopian bias that erases individual differences and long-tail behaviors.
Significance. If the empirical comparisons and bias findings hold after rigorous validation of the ground-truth data, this work would be significant for advancing user simulation research in NLP and HCI. It provides the first unified real-world benchmark beyond synthetic or narrow scenarios and surfaces concrete failure modes (homogenization, loss of long-tail events) that could guide mitigation strategies in generative behavior modeling.
major comments (2)
- [Dataset construction / OmniBehavior description] The manuscript states that OmniBehavior is 'constructed entirely from real-world data' and that systematic differences reveal LLM structural bias, but supplies no details on recruitment, logging completeness, demographic coverage, or handling of missing long-tail events. This is load-bearing for the central claim because the reported convergence to a positive average person and loss of individual differences could arise from selection or measurement biases in the reference traces rather than an intrinsic LLM property.
- [Evaluation and results] The abstract reports 'extensive evaluations' of LLMs, performance plateauing with context expansion, and a 'fundamental structural bias,' yet provides no metrics (e.g., behavioral divergence, accuracy on action sequences), statistical tests, data scale (number of users/traces), or controls. Without these, the evidence for both the simulation failures and the specific bias patterns (hyper-activity, homogenization, utopian bias) cannot be assessed for robustness.
minor comments (1)
- [Abstract / bias analysis] Clarify the precise definition of 'utopian bias' and 'positive average person' with concrete examples from the traces to avoid ambiguity in interpretation.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each of the major comments in detail below and have prepared revisions to improve the clarity and completeness of the paper.
read point-by-point responses
-
Referee: [Dataset construction / OmniBehavior description] The manuscript states that OmniBehavior is 'constructed entirely from real-world data' and that systematic differences reveal LLM structural bias, but supplies no details on recruitment, logging completeness, demographic coverage, or handling of missing long-tail events. This is load-bearing for the central claim because the reported convergence to a positive average person and loss of individual differences could arise from selection or measurement biases in the reference traces rather than an intrinsic LLM property.
Authors: We agree that providing more details on dataset construction is important for validating our claims. In the revised version of the manuscript, we will expand the relevant section to include information on recruitment procedures, logging completeness, demographic coverage summaries, and methods for handling missing long-tail events. This will help demonstrate that the observed biases are not artifacts of data collection biases. We note that ethical and privacy considerations limit the extent of detail we can provide on individual participants. revision: yes
-
Referee: [Evaluation and results] The abstract reports 'extensive evaluations' of LLMs, performance plateauing with context expansion, and a 'fundamental structural bias,' yet provides no metrics (e.g., behavioral divergence, accuracy on action sequences), statistical tests, data scale (number of users/traces), or controls. Without these, the evidence for both the simulation failures and the specific bias patterns (hyper-activity, homogenization, utopian bias) cannot be assessed for robustness.
Authors: The full paper contains these metrics and details in the experiments and results sections. To address the concern about accessibility, we will update the abstract to briefly mention key quantitative findings and include a summary of the evaluation metrics, statistical tests, data scale, and controls in the main text or a new table. This revision will make the evidence more readily assessable while preserving the paper's structure. revision: yes
Circularity Check
No circularity: empirical benchmark rests on external real-world traces
full rationale
The paper introduces OmniBehavior as a benchmark constructed from real-world data and evaluates LLMs via direct comparisons of simulated versus authentic long-horizon behaviors. No equations, derivations, fitted parameters, or self-citations appear in the provided text as load-bearing elements for the central claims. The reported structural biases (hyper-activity, homogenization, utopian bias) are presented as outcomes of systematic differences against the external reference traces rather than reducing to any input by construction. This is a standard empirical setup self-contained against external benchmarks, consistent with the default non-circular finding for such papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Real-world data can be integrated into a unified framework capturing long-horizon, cross-scenario, and heterogeneous behavioral patterns without tunnel vision.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel; Jcost_pos_of_ne_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a utopian bias. This results in the loss of individual differences and long-tail behaviors.
-
IndisputableMonolith/Cost.leanJcost_unit0; Jcost_pos_of_ne_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Real human behavior is inherently sparse, with positive interaction rates remaining below 10%. By contrast, all evaluated LLM-based simulators exhibit a hyper-activity bias.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Anthropic. Introducing claude 4. https://www.anthropic.com/news/claude-4, May 2025
work page 2025
-
[3]
Anthropic. Introducing claude haiku 4.5. https://www.anthropic.com/news/claude-haiku-4-5, October 2025
work page 2025
-
[4]
Anthropic. Introducing claude opus 4.5. https://www.anthropic.com/news/claude-opus-4-5, November 2025
work page 2025
-
[5]
Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/claude-sonnet-4-5, September 2025
work page 2025
-
[6]
W Brian Arthur. Designing economic agents that act like human agents: A behavioral approach to bounded rationality.The American economic review, 81(2):353–359, 1991
work page 1991
- [7]
-
[8]
Simulations in recommender systems: An industry perspective.arXiv preprint arXiv:2109.06723, 2021
Lucas Bernardi, Sakshi Batra, and Cintia Alicia Bruscantini. Simulations in recommender systems: An industry perspective.arXiv preprint arXiv:2109.06723, 2021
-
[9]
Simuser: Simulating user behavior with large language models for recommender system evaluation
Nicolas Bougie and Narimawa Watanabe. Simuser: Simulating user behavior with large language models for recommender system evaluation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 43–60, 2025
work page 2025
-
[10]
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Jiawei Chen, Xinyan Guan, Qianhao Yuan, Guozhao Mo, Weixiang Zhou, Yaojie Lu, Hongyu Lin, Ben He, Le Sun, and Xianpei Han. Consistentchat: Building skeleton-guided consistent multi-turn dialogues for large language models from scratch. InThe 2025 Conference on Empirical Methods in Natural Language Processing, 2025. 12
work page 2025
-
[12]
Simulation of individual and group behavior.The American Economic Review, pages 920–932, 1960
Geoffrey PE Clarkson and Herbert A Simon. Simulation of individual and group behavior.The American Economic Review, pages 920–932, 1960
work page 1960
-
[13]
A computational approach to politeness with application to social factors
Cristian Danescu, Niculescu Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christo- pher Potts. A computational approach to politeness with application to social factors. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 250–259, 2013
work page 2013
-
[14]
Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1–24, 2024
work page 2024
-
[15]
Kuairec: A fully-observed dataset and insights for evaluating recommender systems
Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. Kuairec: A fully-observed dataset and insights for evaluating recommender systems. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 540–550, 2022
work page 2022
-
[16]
Gemini 3 flash: frontier intelligence built for speed
Google. Gemini 3 flash: frontier intelligence built for speed. https://blog.google/products-and- platforms/products/gemini/gemini-3-flash/, December 2025
work page 2025
-
[17]
F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context.Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015
work page 2015
-
[18]
SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, and Paul Röttger. Simbench: Benchmarking the ability of large language models to simulate human behaviors. arXiv preprint arXiv:2510.17516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Recsim: A configurable simulation platform for recommender systems
Eugene Ie, Chih-wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. Recsim: A configurable simulation platform for recommender systems. arXiv preprint arXiv:1909.04847, 2019
-
[21]
Ye Jin, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue Zhou, and Jiangtao Gong. Surrealdriver: Designing generative driver agent simulation framework in urban contexts based on large language model.arXiv preprint arXiv:2309.13193, 5(7):8, 2023
-
[22]
Andreas Konstantin Kruff, Christin Katharina Kreutz, Timo Breuer, Philipp Schaer, and Krisz- tian Balog. Sim4ia-bench: A user simulation benchmark suite for next query and utterance prediction.arXiv preprint arXiv:2511.09329, 2025
-
[23]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023
work page 2023
-
[24]
Artificial intelligence and simulation: An introduction
R Greer Lavery. Artificial intelligence and simulation: An introduction. InProceedings of the 18th conference on Winter simulation, pages 448–452, 1986
work page 1986
-
[25]
Field theory in social science: selected theoretical papers (edited by dorwin cartwright.)
Kurt Lewin. Field theory in social science: selected theoretical papers (edited by dorwin cartwright.). 1951
work page 1951
-
[26]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024. 13
work page 2024
-
[28]
Evaluating Very Long-Term Conversational Memory of LLM Agents
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [29]
-
[30]
The place of modeling in cognitive science.Topics in Cognitive Science, 1(1):11–38, 2009
James L McClelland. The place of modeling in cognitive science.Topics in Cognitive Science, 1(1):11–38, 2009
work page 2009
-
[31]
Some methods of classification and analysis of multivariate observations
James B McQueen. Some methods of classification and analysis of multivariate observations. InProc. of 5th Berkeley Symposium on Math. Stat. and Prob., pages 281–297, 1967
work page 1967
-
[32]
X Mou, Z Wei, and X Huang. Unveiling the truth and facilitating change: Towards agent-based large-scale social movement simulation. arxiv 2024.arXiv preprint arXiv:2402.16333
-
[33]
Mohd Naveed Uddin. Cognitive science and artificial intelligence: simulating the human mind and its complexity.Cognitive Computation and Systems, 1(4):113–116, 2019
work page 2019
-
[34]
OpenAI. Gpt-5.2. https://openai.com/zh-Hans-CN/index/introducing-gpt-5-2/, 2025
work page 2025
-
[35]
Generative agents: Interactive simulacra of human behavior
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023
work page 2023
-
[36]
Artificial intelligence and virtual worlds–toward human-level ai agents
Vladimir M Petrovi´c. Artificial intelligence and virtual worlds–toward human-level ai agents. IEEE Access, 6:39976–39988, 2018
work page 2018
-
[37]
Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. InProceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 2685–2692, 2020
work page 2020
-
[38]
Priyanshu Priya, Mauajama Firdaus, and Asif Ekbal. Computational politeness in natural language processing: A survey.ACM Computing Surveys, 56(9):1–42, 2024
work page 2024
-
[39]
KuaiLive: A Real-time Interactive Dataset for Live Streaming Recommendation
Changle Qu, Sunhao Dai, Ke Guo, Liqin Zhao, Yanan Niu, Xiao Zhang, and Jun Xu. Kuailive: A real-time interactive dataset for live streaming recommendation.arXiv preprint arXiv:2508.05633, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
A plea for (good) simulations: nudging economics toward an experimental science
Julian Reiss. A plea for (good) simulations: nudging economics toward an experimental science. Simulation & gaming, 42(2):243–264, 2011
work page 2011
-
[41]
Ruiyang Ren, Peng Qiu, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Hua Wu, Ji-Rong Wen, and Haifeng Wang. Bases: Large-scale web search user simulation with large language model based agents.arXiv preprint arXiv:2402.17505, 2024
-
[42]
Towards Understanding Sycophancy in Language Models
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning
Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and An-Xiang Zeng. Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4902–4909, 2019
work page 2019
-
[44]
Patrick Taillandier, Jean Daniel Zucker, Arnaud Grignard, Benoit Gaudou, Nghi Quang Huynh, and Alexis Drogoul. Integrating llm in agent-based social simulation: Opportunities and challenges.arXiv preprint arXiv:2507.19364, 2025
-
[45]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
User behavior simulation with large language model based agents
Lei Wang, Jingsen Zhang, Hao Yang, Zhiyuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Ruihua Song, Wayne Xin Zhao, et al. When large language model based agent meets user behavior analysis: A novel user simulation paradigm.arXiv preprint arXiv:2306.02552, 2023
-
[47]
Characterbox: Evaluating the role-playing capabilities of llms in text-based virtual worlds
Lei Wang, Jianxun Lian, Yi Huang, Yanqi Dai, Haoxuan Li, Xu Chen, Xing Xie, and Ji-Rong Wen. Characterbox: Evaluating the role-playing capabilities of llms in text-based virtual worlds. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long P...
work page 2025
-
[48]
Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models
Noah Wang, Zy Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, et al. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 14743–14777, 2024
work page 2024
-
[49]
A survey on llm-based agents for social simulation: Taxonomy, evaluation and applications
Zixu Wang, Bin Xie, Bingbing Xu, Shengmao Zhu, Yige Yuan, Liang Pang, Long Yang Du Su, Zixuan Li, Huawei Shen, and Xueqi Cheng. A survey on llm-based agents for social simulation: Taxonomy, evaluation and applications
-
[50]
Qiuejie Xie, Qiming Feng, Tianqi Zhang, Qingqiu Li, Linyi Yang, Yuejie Zhang, Rui Feng, Liang He, Shang Gao, and Yue Zhang. Human simulacra: Benchmarking the personification of large language models.arXiv preprint arXiv:2402.18180, 2024
-
[51]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Evaluating large language models as generative user simulators for conversational recommendation
Se-eun Yoon, Zhankui He, Jessica Echterhoff, and Julian McAuley. Evaluating large language models as generative user simulators for conversational recommendation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1490–1504, 2024
work page 2024
-
[54]
Guanghu Yuan, Fajie Yuan, Yudong Li, Beibei Kong, Shujie Li, Lei Chen, Min Yang, Chenyun Yu, Bo Hu, Zang Li, et al. Tenrec: A large-scale multipurpose benchmark dataset for rec- ommender systems.Advances in Neural Information Processing Systems, 35:11480–11493, 2022
work page 2022
-
[55]
Glm-4.7: Advancing the coding capability
Z.ai. Glm-4.7: Advancing the coding capability. https://z.ai/blog/glm-4.7, December 2025
work page 2025
-
[56]
Agentcf: Collaborative learning with autonomous language agents for recommender systems
Junjie Zhang, Yupeng Hou, Ruobing Xie, Wenqi Sun, Julian McAuley, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. Agentcf: Collaborative learning with autonomous language agents for recommender systems. InProceedings of the ACM Web Conference 2024, pages 3679–3689, 2024
work page 2024
-
[57]
Ai-salesman: Towards reliable large language model driven telemarketing
Qingyu Zhang, Chunlei Xin, Xuanang Chen, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Qing Ye, Qianlong Xie, and Xingxing Wang. Ai-salesman: Towards reliable large language model driven telemarketing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34790–34798, 2026
work page 2026
-
[58]
Evaluating conversational recommender systems via user simulation
Shuo Zhang and Krisztian Balog. Evaluating conversational recommender systems via user simulation. InProceedings of the 26th acm sigkdd international conference on knowledge discovery & data mining, pages 1512–1520, 2020
work page 2020
-
[59]
Nan Zhao, Haoran Li, Youzheng Wu, Xiaodong He, and Bowen Zhou. The jddc 2.0 corpus: A large-scale multimodal multi-turn chinese dialogue dataset for e-commerce customer service. arXiv preprint arXiv:2109.12913, 2021
-
[60]
Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1059–1068, 2018. 15 A Data Statistics A.1 Action Sequence Length Distribution F...
work page 2018
-
[61]
Live streaming type: What type of live streaming is this (E-commerce / gaming / chatting / talent performance, etc.) 2. Host characteristics: The host’s basic appearance features 3. Image text: Extract key text from the cover (**Note: Only extract core text such as live streaming title, product names, prices, promotional information, etc. Do NOT extract b...
- [62]
-
[63]
text" Output Format: Return only a JSON object containing two fields:
Three specific keywords (Keywords). Ignore the interactive form of the text. Even if it is casual chat between friends, look beyond the social surface and identify the underlying topic being discussed. Content: "text" Output Format: Return only a JSON object containing two fields: "category" and "keywords" (a list of strings). Example: "category": "Techno...
-
[64]
Noise Removal: * Remove meaningless garbled characters (e.g., AC:BU526, IC·BQ528, within 50 meters, and other interfering information). * Filter excessively redundant filler words, such as repeated occurrences of “uh,” “ah,” “that is to say,” retaining only those necessary for context
-
[65]
Semantic Correction: * Correct obvious recognition errors (e.g., change “cumin cowhide” to “naturally revealed,” or infer based on context; if the correct meaning cannot be determined, keep the original). * Complete broken sentences and add commas, periods, or question marks appropriately based on tone and emphasis. 4. Formatting Standards: * Unify full-w...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.