SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Pith reviewed 2026-05-18 06:25 UTC · model grok-4.3
The pith
Large language models reach only modest fidelity when simulating human behaviors on a new unified benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By unifying twenty datasets that together cover a broad range of human decision tasks, SimBench reveals that current large language models achieve a simulation fidelity of 40.80 out of 100, that this score scales log-linearly with model size but not with added inference-time compute, that instruction tuning creates an alignment-simulation tradeoff favoring consensus questions over diverse ones, and that simulation skill correlates most strongly with performance on knowledge-intensive reasoning benchmarks such as MMLU-Pro.
What carries the argument
SimBench, the benchmark formed by standardizing and unifying twenty diverse human-behavior datasets into a single, comparable fidelity score.
If this is right
- Larger models will continue to improve at human simulation in a predictable way.
- Giving models more compute at inference time will not raise their simulation scores.
- Models will remain weaker at reproducing answers from particular demographic groups than from the overall population.
- Progress on general reasoning benchmarks will continue to predict gains in simulation ability.
Where Pith is reading between the lines
- Researchers could use SimBench scores as a training signal to fine-tune models specifically for higher simulation fidelity.
- The demographic gaps suggest that current alignment methods may limit how well models can represent varied human populations in social-science applications.
- If the tradeoff holds, developers might need separate model versions for high-consensus versus high-diversity simulation tasks.
Load-bearing premise
The twenty chosen datasets, once unified, give a valid and representative picture of the full range of human behavioral variation.
What would settle it
Running the same models on a large new collection of human responses drawn from a population study outside the original twenty datasets and obtaining scores far above or below the reported 40.80 level.
Figures
read the original abstract
Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations of simulation fidelity are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that the best LLMs today achieve meaningful but modest simulation fidelity (score: 40.80/100), with performance scaling log-linearly with model size but not with increased inference-time compute. We discover an alignment-simulation tradeoff: instruction tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with knowledge-intensive reasoning (MMLU-Pro, r = 0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SimBench, a standardized benchmark created by unifying 20 diverse human-behavior datasets spanning moral decision-making, economic choice, and other tasks drawn from a large global participant pool. The central claims are that current best LLMs reach a modest simulation fidelity of 40.80/100, that fidelity scales log-linearly with model size but not with increased inference-time compute, that instruction tuning produces an alignment-simulation tradeoff (improving low-entropy/consensus items while degrading high-entropy/diverse ones), that models struggle to simulate specific demographic groups, and that simulation ability correlates strongly with knowledge-intensive reasoning (MMLU-Pro, r = 0.939).
Significance. If the unified benchmark constitutes a representative ground truth, the work supplies the first large-scale, reproducible instrument for measuring LLM simulation fidelity. The reported scaling relationship, alignment tradeoff, demographic gaps, and correlation with MMLU-Pro would then constitute actionable, falsifiable findings that could guide both model development and the responsible use of LLMs as proxies in the social and behavioral sciences.
major comments (1)
- [Abstract] Abstract (paragraph on dataset unification): the claim that the 20 chosen datasets supply a valid and representative ground truth for human behavioral variation is load-bearing for the headline fidelity score (40.80/100), the log-linear scaling result, and the demographic findings, yet the manuscript provides no explicit inclusion criteria, no description of response-format harmonization (e.g., probability binning, Likert rescaling, or handling of varying participant N), and no test for coverage of high- versus low-entropy behaviors. Without these details the reported metrics risk being artifacts of benchmark construction rather than intrinsic properties of the models.
minor comments (2)
- [Abstract] The abstract states that performance 'scales log-linearly with model size' but does not indicate which model sizes were included or report the regression statistics; adding these details would strengthen the scaling claim.
- [Abstract] Consider reporting the total number of unique participants and a brief diversity metric for the 'large global participant pool' to allow readers to assess coverage.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We agree that greater transparency regarding dataset selection and harmonization is essential to substantiate the representativeness of SimBench. We address the concern point by point below and will incorporate the requested clarifications in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract (paragraph on dataset unification): the claim that the 20 chosen datasets supply a valid and representative ground truth for human behavioral variation is load-bearing for the headline fidelity score (40.80/100), the log-linear scaling result, and the demographic findings, yet the manuscript provides no explicit inclusion criteria, no description of response-format harmonization (e.g., probability binning, Likert rescaling, or handling of varying participant N), and no test for coverage of high- versus low-entropy behaviors. Without these details the reported metrics risk being artifacts of benchmark construction rather than intrinsic properties of the models.
Authors: We thank the referee for this important observation. The full manuscript contains a 'Data Sources' section describing the 20 datasets, but we acknowledge that explicit inclusion criteria, harmonization procedures, and entropy coverage were not stated with sufficient precision. In the revised version we will add a dedicated subsection 'Benchmark Construction and Validation' that specifies: (1) inclusion criteria (public availability, minimum 500 participants, coverage of at least five distinct behavioral domains, and global participant diversity); (2) harmonization details (conversion of all responses to a common [0,1] probability scale, Likert rescaling via min-max normalization, and participant-weighted averaging to accommodate varying N); and (3) an entropy analysis, including a new supplementary figure that reports the distribution of item-level entropy across the benchmark and confirms balanced coverage of high-entropy (diverse) and low-entropy (consensus) behaviors. These additions will make the ground-truth claim more robust without altering the reported fidelity scores or scaling relationships. revision: yes
Circularity Check
Empirical benchmark construction against external datasets exhibits no circularity
full rationale
The paper introduces SimBench as a unification of 20 independent human behavior datasets drawn from external sources and reports direct empirical measurements of LLM simulation fidelity (e.g., 40.80/100 score, log-linear scaling with model size, alignment-simulation tradeoff, and MMLU-Pro correlation). These results are computed by comparing model outputs to the held-out human data distributions rather than being derived from any fitted parameters, self-referential equations, or ansatzes internal to the paper. No derivation chain reduces to its own inputs by construction, and the central claims rest on observable performance against independent ground truth rather than self-citation load-bearing or renaming of known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 20 datasets together cover a representative sample of human behavioral variation across moral, economic, and other domains.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We unify 20 diverse datasets... standardize all data into group-level probability distributions... S(P, Q) = 100 (1 - TVD(P,Q)/TVD(P,U))
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 7 Pith papers
-
Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench
ConsumerSimBench evaluates 13 LLMs on reconstructing crowd reactions from 1,553 Chinese social-media topics using 23,122 auditable yes-no criteria, finding maximum coverage of 47.8% by Gemini-3.1-Pro.
-
Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
Introduces OmniBehavior benchmark from real-world data and shows LLMs exhibit hyper-activity, persona homogenization, and utopian bias in behavior simulation.
-
Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
OmniBehavior benchmark demonstrates that LLMs simulating real human behavior converge on hyper-active positive average personas, losing long-tail individual differences.
-
PrivacySIM: Evaluating LLM Simulation of User Privacy Behavior
PrivacySIM shows that conditioning LLMs on user personas like demographics and attitudes improves simulation of privacy choices but reaches only 40.4% accuracy against real responses from 1,000 users.
-
LLM-Based Educational Simulation: Evaluating Temporal Student Persona Stability Across ADHD Profiles
LLM-simulated ADHD student personas show stable self-reported traits but behavioral drift in unscripted interactions that explicit task prompts fully eliminate.
-
LLM-Based Educational Simulation: Evaluating Temporal Student Persona Stability Across ADHD Profiles
LLM student personas with ADHD show stable self-reported traits at high intensity but behavioral drift in unscripted interactions that scripted prompts eliminate.
-
The $\textit{Silicon Society}$ Cookbook: Design Space of LLM-based Social Simulations
The base LLM choice dominates simulation outcomes in LLM-based social networks, while other design parameters show either additive or complex interactive effects.
Reference graph
Works this paper leans on
-
[1]
URLhttps://aclanthology.org/2023.emnlp-main.669/. Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N. Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: an open platform for evaluating llms by human preference. InProceedings of the 41st International Conference on Machine...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Yijiang River Dong, Tiancheng Hu, and Nigel Collier
URL https://proceedings.neurips.cc/paper_files/paper/2024/ file/515c62809e0a29729d7eec26e2916fc0-Paper-Conference.pdf. Yijiang River Dong, Tiancheng Hu, and Nigel Collier. Can LLM be a personalized judge? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 10126–10141, Miam...
work page 2024
-
[3]
Can LLM be a Personalized Judge?
Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.592. URLhttps://aclanthology.org/2024.findings-emnlp.592/. Yijiang River Dong, Tiancheng Hu, Yinhong Liu, Ahmet Üstün, and Nigel Collier. When personal- ization meets reality: A multi-faceted analysis of personalized preference learning.arXiv preprint arXiv:2502.19158, 2025. E...
-
[4]
doi: 10.1038/s41586-025-09422-z
URL https://proceedings.neurips.cc/paper_files/paper/2023/ hash/89e44582fd28ddfea1ea4dcb0ebbf4b0-Abstract-Datasets_and_ Benchmarks.html. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Li...
-
[5]
GESIS, Cologne. ZA8000 Data file Version 2.0.0, https://doi.org/10.4232/5.ZA8000.2.0.0, 2024. Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. How can we know when language models know? on the calibration of language models for question answering.Transactions of the Association for Computational Linguistics, 9:962–977, 2021. doi: 10.1162/tacl_a_0...
-
[6]
Much of our source data is not easily ingested by standard web scrapes. Virtually all of our source data are primarily distributed as structured data files in specialized formats like R or SAS in academic archives. This makes data contamination much less likely than for generic web text, as these files cannot be meaningfully read or interpreted as plain t...
-
[7]
SimBench’s core task is not to recall a fact but to predict a response distribution for a specific demographic subgroup (e.g., “women in Slovakia”). Even if a model’s training data included thousands of individual survey responses, it would still need to learn, without supervision, how to aggregate these individual points into a coherent distribution for ...
-
[8]
If models had memorized this benchmark, we would expect scores far closer to perfect
On datasets that are most likely to appear in training data (e.g., US-centric OpinionQA), even the best-performing models achieve an S-score of only 60, far from the 100-point maximum. If models had memorized this benchmark, we would expect scores far closer to perfect. This clear performance ceiling demonstrates that our benchmark is testing a genuine ca...
-
[9]
You are an Amazon Mechanical Turk worker based in the United States
The consistent scaling patterns we observe across diverse datasets suggest genuine simulation capabilities rather than artifacts of training data overlap. Nevertheless, we acknowledge that data contamination remains a fundamental challenge in LLM evaluation, and future work should develop more robust methods to detect and quantify its impact. We include t...
work page 2020
-
[13]
Use this exact valid JSON format: {answer options} and do NOT include anything else
-
[14]
Only output your final answer and nothing else. No explanations or intermediate steps are needed.,→ Replace X with your estimated percentages for each option. '**Answer**: Prompt for zero-shot CoT: **Question**: {question} Estimate what percentage of your group would choose each option. Think step by step about how people with your shared characteristics ...
-
[15]
Use whole numbers from 0 to 100
-
[16]
Ensure the percentages sum to exactly 100
-
[17]
Only include the numbers (no %
-
[18]
Use this exact valid JSON format: {json_format_str}
-
[19]
Replace X with your estimated percentages for each option. '**Answer**: E VALIDATION OFELICITATIONMETHOD A key methodological choice in SIMBENCHis how to elicit probability distributions from LLMs. For base models, we use direct token probabilities from the first token of the response. For instruction- tuned models, however, two primary methods exist: dir...
-
[20]
A Harmful Indirect Effect (-1.74 points):Instruction tuning significantly reduces model prediction entropy (Path A: β=−0.11 , p < .001 ). In our models, higher entropy is generally associated with better performance (Path B: β= 15.60 , p < .001 ). The indirect effect (A×B ) is therefore negative ( −1.74), quantifying the performance penalty that instructi...
-
[21]
A Strong, Helpful Direct Effect (+6.46 points):After accounting for the change in entropy, a large positivedirect effectremains ( β= +6.46 , p < .001 ). This reflects the benefits of instruction tuning that are independent of its impact on output diversity, such as improved instruction following and a better ability to reason about the specified persona. ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.