pith. sign in

arxiv: 2510.17516 · v4 · submitted 2025-10-20 · 💻 cs.CL · cs.AI· cs.CY· cs.LG

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Pith reviewed 2026-05-18 06:25 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CYcs.LG
keywords LLM simulationhuman behavior modelingsimulation fidelitybenchmarkalignment tradeoffmodel scaling
0
0 comments X

The pith

Large language models reach only modest fidelity when simulating human behaviors on a new unified benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates SimBench by combining twenty existing datasets on topics from moral choices to economic decisions into one standardized test of how well language models can stand in for real people. It reports that today's strongest models score about forty-one out of one hundred on this measure, with accuracy rising steadily as models grow larger but staying flat when models are given extra thinking steps at test time. The work also finds that instruction tuning helps models match the average answer on questions where people largely agree, yet hurts performance on questions where human answers spread out widely. Models do especially poorly at matching the responses of specific demographic groups. If these patterns hold, the benchmark gives researchers a concrete way to track whether future models become reliable enough to support social-science studies without running new human experiments each time.

Core claim

By unifying twenty datasets that together cover a broad range of human decision tasks, SimBench reveals that current large language models achieve a simulation fidelity of 40.80 out of 100, that this score scales log-linearly with model size but not with added inference-time compute, that instruction tuning creates an alignment-simulation tradeoff favoring consensus questions over diverse ones, and that simulation skill correlates most strongly with performance on knowledge-intensive reasoning benchmarks such as MMLU-Pro.

What carries the argument

SimBench, the benchmark formed by standardizing and unifying twenty diverse human-behavior datasets into a single, comparable fidelity score.

If this is right

  • Larger models will continue to improve at human simulation in a predictable way.
  • Giving models more compute at inference time will not raise their simulation scores.
  • Models will remain weaker at reproducing answers from particular demographic groups than from the overall population.
  • Progress on general reasoning benchmarks will continue to predict gains in simulation ability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Researchers could use SimBench scores as a training signal to fine-tune models specifically for higher simulation fidelity.
  • The demographic gaps suggest that current alignment methods may limit how well models can represent varied human populations in social-science applications.
  • If the tradeoff holds, developers might need separate model versions for high-consensus versus high-diversity simulation tasks.

Load-bearing premise

The twenty chosen datasets, once unified, give a valid and representative picture of the full range of human behavioral variation.

What would settle it

Running the same models on a large new collection of human responses drawn from a population study outside the original twenty datasets and obtaining scores far above or below the reported 40.80 level.

Figures

Figures reproduced from arXiv: 2510.17516 by Dirk Hovy, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Paul R\"ottger, Tiancheng Hu.

Figure 1
Figure 1. Figure 1: SIMBENCH is the first large-scale benchmark to evaluate how well LLMs can simulate group-level human behavior across diverse simulation settings and tasks. 1 INTRODUCTION Large-scale human experiments and surveys have long been essential tools for informing public policy, commercial decisions, and academic research. Running experiments and surveys, however, is costly and time-consuming. Large language mode… view at source ↗
Figure 2
Figure 2. Figure 2: Model parameter count vs. simulation ability. We measure model size by parameter count and simulation ability by SIMBENCH score S averaged across the two main splits of SIMBENCH. 2) Test-Time Compute To analyze the effects of increasing test-time compute on LLM simulation ability, we conduct two sets of experiments. We compare the performance of two distinct o4-mini checkpoints (‘low‘ vs. ‘high‘, which var… view at source ↗
Figure 3
Figure 3. Figure 3: Simulation fidelity by dataset as measured by SIMBENCH score S for each of the 20 datasets in SimBenchPop. We show results for the top five models based on results in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The alignment-simulation tradeoff: Instruction-tuning helps on consensus questions but hurts on diverse ones. The plot aggregates results from 13 base/instruction-tuned model pairs. Each point represents the average improvement in SIMBENCH Score for one of 25 entropy bins. The point’s x-coordinate is the mean entropy of all questions within that bin. Error bars show the standard error of the mean. investig… view at source ↗
Figure 5
Figure 5. Figure 5: compares the SIMBENCH scores for several instruction-tuned models using both methods. The results are unequivocal: using verbalized distributions (teal dots) dramatically and consistently outperforms direct token probabilities (orange dots) for every instruction-tuned model tested. In many cases, using token probabilities results in scores far below zero, indicating that the model’s raw logits are poorly c… view at source ↗
Figure 6
Figure 6. Figure 6: Model parameter count vs. simulation ability. We measure model size by parameter count and simulation ability by SIMBENCH score S averaged across the two main splits of SIMBENCH. H.1 OVERALL MODEL RANKING AND SIGNIFICANCE [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Response plurality vs. simulation fidelity for base and instruction-tuned models on all questions in SimBenchPop. We measure response plurality by normalized entropy of the human response distribution and simulation fidelity by total variation distance at the question level. • For base models: The coefficient on the interaction between base models and Human Normalized Entropy is −0.2451 (p < 0.001), indica… view at source ↗
Figure 8
Figure 8. Figure 8: Scatter plots showing the correlation between average [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effect of Centaur Fine-Tuning. The plots show the binned SIMBENCH Score against normalized human response entropy for Llama-3.1 models at the 70B (left) and 8B (right) scales. Shaded areas represent the 95% confidence interval of the mean score in each bin. L DATASET DETAILS We provide details on each of the 20 datasets in SIMBENCH. Note that for many datasets we use only a subset of questions and particip… view at source ↗
read the original abstract

Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations of simulation fidelity are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that the best LLMs today achieve meaningful but modest simulation fidelity (score: 40.80/100), with performance scaling log-linearly with model size but not with increased inference-time compute. We discover an alignment-simulation tradeoff: instruction tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with knowledge-intensive reasoning (MMLU-Pro, r = 0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces SimBench, a standardized benchmark created by unifying 20 diverse human-behavior datasets spanning moral decision-making, economic choice, and other tasks drawn from a large global participant pool. The central claims are that current best LLMs reach a modest simulation fidelity of 40.80/100, that fidelity scales log-linearly with model size but not with increased inference-time compute, that instruction tuning produces an alignment-simulation tradeoff (improving low-entropy/consensus items while degrading high-entropy/diverse ones), that models struggle to simulate specific demographic groups, and that simulation ability correlates strongly with knowledge-intensive reasoning (MMLU-Pro, r = 0.939).

Significance. If the unified benchmark constitutes a representative ground truth, the work supplies the first large-scale, reproducible instrument for measuring LLM simulation fidelity. The reported scaling relationship, alignment tradeoff, demographic gaps, and correlation with MMLU-Pro would then constitute actionable, falsifiable findings that could guide both model development and the responsible use of LLMs as proxies in the social and behavioral sciences.

major comments (1)
  1. [Abstract] Abstract (paragraph on dataset unification): the claim that the 20 chosen datasets supply a valid and representative ground truth for human behavioral variation is load-bearing for the headline fidelity score (40.80/100), the log-linear scaling result, and the demographic findings, yet the manuscript provides no explicit inclusion criteria, no description of response-format harmonization (e.g., probability binning, Likert rescaling, or handling of varying participant N), and no test for coverage of high- versus low-entropy behaviors. Without these details the reported metrics risk being artifacts of benchmark construction rather than intrinsic properties of the models.
minor comments (2)
  1. [Abstract] The abstract states that performance 'scales log-linearly with model size' but does not indicate which model sizes were included or report the regression statistics; adding these details would strengthen the scaling claim.
  2. [Abstract] Consider reporting the total number of unique participants and a brief diversity metric for the 'large global participant pool' to allow readers to assess coverage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We agree that greater transparency regarding dataset selection and harmonization is essential to substantiate the representativeness of SimBench. We address the concern point by point below and will incorporate the requested clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph on dataset unification): the claim that the 20 chosen datasets supply a valid and representative ground truth for human behavioral variation is load-bearing for the headline fidelity score (40.80/100), the log-linear scaling result, and the demographic findings, yet the manuscript provides no explicit inclusion criteria, no description of response-format harmonization (e.g., probability binning, Likert rescaling, or handling of varying participant N), and no test for coverage of high- versus low-entropy behaviors. Without these details the reported metrics risk being artifacts of benchmark construction rather than intrinsic properties of the models.

    Authors: We thank the referee for this important observation. The full manuscript contains a 'Data Sources' section describing the 20 datasets, but we acknowledge that explicit inclusion criteria, harmonization procedures, and entropy coverage were not stated with sufficient precision. In the revised version we will add a dedicated subsection 'Benchmark Construction and Validation' that specifies: (1) inclusion criteria (public availability, minimum 500 participants, coverage of at least five distinct behavioral domains, and global participant diversity); (2) harmonization details (conversion of all responses to a common [0,1] probability scale, Likert rescaling via min-max normalization, and participant-weighted averaging to accommodate varying N); and (3) an entropy analysis, including a new supplementary figure that reports the distribution of item-level entropy across the benchmark and confirms balanced coverage of high-entropy (diverse) and low-entropy (consensus) behaviors. These additions will make the ground-truth claim more robust without altering the reported fidelity scores or scaling relationships. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark construction against external datasets exhibits no circularity

full rationale

The paper introduces SimBench as a unification of 20 independent human behavior datasets drawn from external sources and reports direct empirical measurements of LLM simulation fidelity (e.g., 40.80/100 score, log-linear scaling with model size, alignment-simulation tradeoff, and MMLU-Pro correlation). These results are computed by comparing model outputs to the held-out human data distributions rather than being derived from any fitted parameters, self-referential equations, or ansatzes internal to the paper. No derivation chain reduces to its own inputs by construction, and the central claims rest on observable performance against independent ground truth rather than self-citation load-bearing or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the selected datasets constitute a sufficient proxy for human behavior distributions and that the chosen fidelity metric meaningfully captures simulation quality.

axioms (1)
  • domain assumption The 20 datasets together cover a representative sample of human behavioral variation across moral, economic, and other domains.
    Invoked when claiming the benchmark enables fundamental questions about LLM simulation success and failure.

pith-pipeline@v0.9.0 · 5780 in / 1274 out tokens · 58730 ms · 2026-05-18T06:25:02.187332+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

    cs.CL 2026-05 unverdicted novelty 7.0

    ConsumerSimBench evaluates 13 LLMs on reconstructing crowd reactions from 1,553 Chinese social-media topics using 23,122 auditable yes-no criteria, finding maximum coverage of 47.8% by Gemini-3.1-Pro.

  2. Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

    cs.CL 2026-04 unverdicted novelty 7.0

    Introduces OmniBehavior benchmark from real-world data and shows LLMs exhibit hyper-activity, persona homogenization, and utopian bias in behavior simulation.

  3. Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

    cs.CL 2026-04 unverdicted novelty 7.0

    OmniBehavior benchmark demonstrates that LLMs simulating real human behavior converge on hyper-active positive average personas, losing long-tail individual differences.

  4. PrivacySIM: Evaluating LLM Simulation of User Privacy Behavior

    cs.CR 2026-05 unverdicted novelty 6.0

    PrivacySIM shows that conditioning LLMs on user personas like demographics and attitudes improves simulation of privacy choices but reaches only 40.4% accuracy against real responses from 1,000 users.

  5. LLM-Based Educational Simulation: Evaluating Temporal Student Persona Stability Across ADHD Profiles

    cs.HC 2026-05 unverdicted novelty 5.0

    LLM-simulated ADHD student personas show stable self-reported traits but behavioral drift in unscripted interactions that explicit task prompts fully eliminate.

  6. LLM-Based Educational Simulation: Evaluating Temporal Student Persona Stability Across ADHD Profiles

    cs.HC 2026-05 unverdicted novelty 5.0

    LLM student personas with ADHD show stable self-reported traits at high intensity but behavioral drift in unscripted interactions that scripted prompts eliminate.

  7. The $\textit{Silicon Society}$ Cookbook: Design Space of LLM-based Social Simulations

    cs.MA 2026-04 unverdicted novelty 5.0

    The base LLM choice dominates simulation outcomes in LLM-based social networks, while other design parameters show either additive or complex interactive effects.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 5 Pith papers · 1 internal anchor

  1. [1]

    DeepSeek-V3 Technical Report

    URLhttps://aclanthology.org/2023.emnlp-main.669/. Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N. Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: an open platform for evaluating llms by human preference. InProceedings of the 41st International Conference on Machine...

  2. [2]

    Yijiang River Dong, Tiancheng Hu, and Nigel Collier

    URL https://proceedings.neurips.cc/paper_files/paper/2024/ file/515c62809e0a29729d7eec26e2916fc0-Paper-Conference.pdf. Yijiang River Dong, Tiancheng Hu, and Nigel Collier. Can LLM be a personalized judge? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 10126–10141, Miam...

  3. [3]

    Can LLM be a Personalized Judge?

    Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.592. URLhttps://aclanthology.org/2024.findings-emnlp.592/. Yijiang River Dong, Tiancheng Hu, Yinhong Liu, Ahmet Üstün, and Nigel Collier. When personal- ization meets reality: A multi-faceted analysis of personalized preference learning.arXiv preprint arXiv:2502.19158, 2025. E...

  4. [4]

    doi: 10.1038/s41586-025-09422-z

    URL https://proceedings.neurips.cc/paper_files/paper/2023/ hash/89e44582fd28ddfea1ea4dcb0ebbf4b0-Abstract-Datasets_and_ Benchmarks.html. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Li...

  5. [5]

    wisdom of crowds

    GESIS, Cologne. ZA8000 Data file Version 2.0.0, https://doi.org/10.4232/5.ZA8000.2.0.0, 2024. Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. How can we know when language models know? on the calibration of language models for question answering.Transactions of the Association for Computational Linguistics, 9:962–977, 2021. doi: 10.1162/tacl_a_0...

  6. [6]

    Virtually all of our source data are primarily distributed as structured data files in specialized formats like R or SAS in academic archives

    Much of our source data is not easily ingested by standard web scrapes. Virtually all of our source data are primarily distributed as structured data files in specialized formats like R or SAS in academic archives. This makes data contamination much less likely than for generic web text, as these files cannot be meaningfully read or interpreted as plain t...

  7. [7]

    women in Slovakia

    SimBench’s core task is not to recall a fact but to predict a response distribution for a specific demographic subgroup (e.g., “women in Slovakia”). Even if a model’s training data included thousands of individual survey responses, it would still need to learn, without supervision, how to aggregate these individual points into a coherent distribution for ...

  8. [8]

    If models had memorized this benchmark, we would expect scores far closer to perfect

    On datasets that are most likely to appear in training data (e.g., US-centric OpinionQA), even the best-performing models achieve an S-score of only 60, far from the 100-point maximum. If models had memorized this benchmark, we would expect scores far closer to perfect. This clear performance ceiling demonstrates that our benchmark is testing a genuine ca...

  9. [9]

    You are an Amazon Mechanical Turk worker based in the United States

    The consistent scaling patterns we observe across diverse datasets suggest genuine simulation capabilities rather than artifacts of training data overlap. Nevertheless, we acknowledge that data contamination remains a fundamental challenge in LLM evaluation, and future work should develop more robust methods to detect and quantify its impact. We include t...

  10. [13]

    Use this exact valid JSON format: {answer options} and do NOT include anything else

  11. [14]

    No explanations or intermediate steps are needed.,→ Replace X with your estimated percentages for each option

    Only output your final answer and nothing else. No explanations or intermediate steps are needed.,→ Replace X with your estimated percentages for each option. '**Answer**: Prompt for zero-shot CoT: **Question**: {question} Estimate what percentage of your group would choose each option. Think step by step about how people with your shared characteristics ...

  12. [15]

    Use whole numbers from 0 to 100

  13. [16]

    Ensure the percentages sum to exactly 100

  14. [17]

    Only include the numbers (no %

  15. [18]

    Use this exact valid JSON format: {json_format_str}

  16. [19]

    verbalized

    Replace X with your estimated percentages for each option. '**Answer**: E VALIDATION OFELICITATIONMETHOD A key methodological choice in SIMBENCHis how to elicit probability distributions from LLMs. For base models, we use direct token probabilities from the first token of the response. For instruction- tuned models, however, two primary methods exist: dir...

  17. [20]

    In our models, higher entropy is generally associated with better performance (Path B: β= 15.60 , p < .001 )

    A Harmful Indirect Effect (-1.74 points):Instruction tuning significantly reduces model prediction entropy (Path A: β=−0.11 , p < .001 ). In our models, higher entropy is generally associated with better performance (Path B: β= 15.60 , p < .001 ). The indirect effect (A×B ) is therefore negative ( −1.74), quantifying the performance penalty that instructi...

  18. [21]

    distribution-preserving

    A Strong, Helpful Direct Effect (+6.46 points):After accounting for the change in entropy, a large positivedirect effectremains ( β= +6.46 , p < .001 ). This reflects the benefits of instruction tuning that are independent of its impact on output diversity, such as improved instruction following and a better ability to reason about the specified persona. ...