Recognition: no theorem link
Post-training makes large language models less human-like
Pith reviewed 2026-05-11 02:43 UTC · model grok-4.3
The pith
Post-training reduces how closely large language models match human behavior in experiments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using the Psych-201 dataset to measure behavioral alignment at scale, we find that post-training consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction does not improve predictions at the level of individuals.
What carries the argument
The Psych-201 dataset, which enables large-scale comparison of model outputs to human behavioral data from psychological tasks and exposes the impact of post-training.
If this is right
- Base models align more closely with human behavior than their post-trained versions across tested tasks.
- The gap between model predictions and human choices grows larger with each new generation of models.
- Conditioning models on individual participant information does not enhance accuracy for predicting specific people's responses.
- The same processes that improve LLMs as general-purpose assistants decrease their value as stand-ins for human participants.
Where Pith is reading between the lines
- Researchers simulating human subjects with LLMs may obtain more accurate results by starting from base models rather than post-trained ones.
- Model development faces an implicit trade-off between everyday usefulness and fidelity to observed human decision patterns.
- New benchmarks focused on behavioral alignment could help track whether future training methods reverse the observed trend.
Load-bearing premise
The tasks and metrics in the Psych-201 dataset provide a neutral and complete picture of human behavior that remains unaffected by post-training processes.
What would settle it
Replicating the measurements on a fresh set of human behavioral experiments collected independently of any model training data, where post-trained models show equal or higher alignment than base models.
read the original abstract
Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training -- the stage that turns base models into useful assistants -- consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction -- a popular technique for eliciting human-like behavior by conditioning models on participant-specific information -- does not improve predictions at the level of individuals. Taken together, our results suggest that the very processes that are currently employed to turn LLMs into useful assistants also make them less accurate models of human behavior.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Psych-201, a new dataset for measuring behavioral alignment between LLMs and humans at scale. It empirically demonstrates that post-training consistently reduces alignment with human behavior across model families, sizes, and objectives. The misalignment is reported to widen in newer model generations even as base models improve. The authors additionally find that persona-induction techniques fail to improve predictions at the individual level and conclude that processes turning LLMs into useful assistants also make them poorer models of human behavior.
Significance. If substantiated with rigorous methods, the result would be significant for researchers using LLMs as human surrogates in behavioral science. It identifies a systematic trade-off between post-training for utility and fidelity to human behavior, which could affect model selection for psychological experiments and motivate training approaches that preserve alignment. The scalable Psych-201 benchmark itself represents a potential contribution if its independence from post-training artifacts is established.
major comments (2)
- [Methods] Methods section on Psych-201 construction: The central claim depends on Psych-201 supplying an independent proxy for human behavior. The paper must explicitly detail task selection, response formats, scoring rules, and any safeguards against embedding post-training biases (e.g., refusal patterns, safety alignments, or helpfulness objectives). Without this, the observed drop in alignment after post-training risks being circular rather than diagnostic, as noted in the stress-test concern.
- [Results] Results on generational trends: The claim that misalignment widens in newer generations requires quantitative support including exact model lists, alignment metric definitions, sample sizes per condition, statistical tests, and error bars. The abstract states the pattern but the full results section must show these controls to substantiate the widening gap across base vs. post-trained models.
minor comments (2)
- [Abstract] Abstract: The statement that post-training reduces alignment 'across model families, sizes, and objectives' should name the specific models and objectives examined to allow immediate assessment of scope.
- [Figures] Figures: All plots of alignment scores should include error bars, legends, and annotations for statistical significance to improve clarity and reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments have prompted us to clarify key aspects of Psych-201 and strengthen the quantitative presentation of generational trends. We address each major comment below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [Methods] Methods section on Psych-201 construction: The central claim depends on Psych-201 supplying an independent proxy for human behavior. The paper must explicitly detail task selection, response formats, scoring rules, and any safeguards against embedding post-training biases (e.g., refusal patterns, safety alignments, or helpfulness objectives). Without this, the observed drop in alignment after post-training risks being circular rather than diagnostic, as noted in the stress-test concern.
Authors: We agree that explicit documentation of Psych-201's construction is required to establish its independence as a human-behavior proxy. In the revised manuscript we have expanded the Methods section with: (i) the precise criteria and sources used for task selection (drawing exclusively from pre-2023 peer-reviewed psychological studies), (ii) standardized response formats and scoring rules (normalized distributional agreement scores), and (iii) explicit safeguards including exclusion of safety-critical items, pre-testing for refusal patterns on base models, and additional stress-test analyses demonstrating that the post-training alignment drop persists after controlling for helpfulness and refusal artifacts. These additions directly address the circularity concern. revision: yes
-
Referee: [Results] Results on generational trends: The claim that misalignment widens in newer generations requires quantitative support including exact model lists, alignment metric definitions, sample sizes per condition, statistical tests, and error bars. The abstract states the pattern but the full results section must show these controls to substantiate the widening gap across base vs. post-trained models.
Authors: The original submission already contains the requested quantitative elements in the Results section and supplementary materials (exact model pairs, Pearson-correlation alignment metric, N=201 tasks with per-task human sample sizes, and ANOVA tests for generational effects). To improve clarity and address the referee's request, we have added error bars to all relevant figures, inserted a summary table listing all models, metrics, and test statistics, and explicitly contrasted base versus post-trained performance within each generation. These revisions make the widening misalignment fully transparent and statistically supported. revision: partial
Circularity Check
No significant circularity; purely empirical comparison
full rationale
The paper introduces the Psych-201 dataset and performs direct empirical measurements of behavioral alignment between human data and outputs from base versus post-trained LLMs across families and sizes. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or description. The central claim rests on observed differences in alignment scores rather than any chain that reduces by construction to its own inputs or to load-bearing self-citations. This is a standard empirical study whose results are falsifiable against external human data and independent model evaluations.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Post-training consistently reduces human-likeness. This effect holds across model families and applies to all post-training objectives, including instruction-tuning, reasoning, and vision
-
[2]
newer models are generally more aligned
Base models continue to improve across generations, i.e. newer models are generally more aligned
-
[3]
Post-training misalignment – defined as the alignment difference between a base model and its post-trained counterpart – widens in newer models
-
[4]
The largest post-training misalignment occurs in the domains of psycholinguistics and rea- soning
-
[5]
Interviewer: What is your age? Interviewee: 35
Persona-induction, a popular technique for eliciting more human-like behavior by condition- ing models on participant-specific information, does not improve predictions at the level of individuals. Taken together, our findings have important implications for using LLMs as behavioral surro- gates. Most prior studies rely on post-trained models, given their...
work page 2000
-
[6]
On the Opportunities and Risks of Foundation Models
R. Bommasani,et al., On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
M. Binz,et al., How should the advancement of large language models affect the practice of science?Proceedings of the National Academy of Sciences122(5), e2401227121 (2025)
work page 2025
-
[8]
J. S. Park,et al., Generative agents: Interactive simulacra of human behavior, inProceedings of the 36th annual acm symposium on user interface software and technology(2023), pp. 1–22
work page 2023
-
[9]
J. S. Park,et al., Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
M. Binz,et al., A foundation model to predict and capture human cognition.Nature644(8078), 1002–1009 (2025)
work page 2025
-
[11]
Z. Cui, N. Li, H. Zhou, A large-scale replication of scenario-based experiments in psychology and management using large language models.Nature Computational Science5(8), 627–634 (2025)
work page 2025
-
[12]
J. Hullman, D. Broska, H. Sun, A. Shaw, This human study did not involve human subjects: Validating LLM simulations as behavioral evidence.arXiv preprint arXiv:2602.15785(2026)
-
[13]
G. V. Aher, R. I. Arriaga, A. T. Kalai, Using large language models to simulate multiple humans and replicate human subject studies, inInternational conference on machine learning(PMLR) (2023), pp. 337–371
work page 2023
-
[14]
R. Marjieh, I. Sucholutsky, P. van Rijn, N. Jacoby, T. L. Griffiths, Large language models predict human sensory judgments across six modalities.Scientific Reports14(1), 21445 (2024)
work page 2024
-
[15]
J. Hu, K. Mahowald, G. Lupyan, A. Ivanova, R. Levy, Language models align with human judgments on key grammatical constructions.Proceedings of the National Academy of Sciences 121(36), e2400917121 (2024). 14
work page 2024
-
[16]
A. K. Lampinen,et al., Language models, like humans, show content effects on reasoning tasks.PNAS nexus3(7), pgae233 (2024)
work page 2024
-
[17]
M. Binz, E. Schulz, Using cognitive psychology to understand GPT-3.Proceedings of the National Academy of Sciences120(6), e2218523120 (2023)
work page 2023
-
[18]
T. Hagendorff, S. Fabi, M. Kosinski, Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT.Nature Computational Science 3(10), 833–838 (2023)
work page 2023
-
[19]
Y. Chen, T. X. Liu, Y. Shan, S. Zhong, The emergence of economic rationality of GPT. Proceedings of the National Academy of Sciences120(51), e2316205120 (2023)
work page 2023
-
[20]
Y. Gao, D. Lee, G. Burtch, S. Fazelpour, Take caution in using LLMs as human surrogates. Proceedings of the National Academy of Sciences122(24), e2501660122 (2025)
work page 2025
-
[21]
L. Kastrati,et al., Agreement between mega-trials and smaller trials: a systematic review and meta-research analysis.JAMA network open7(9), e2432296 (2024)
work page 2024
-
[22]
B. Warner,et al., Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference, inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(2025), pp. 2526–2547
work page 2025
-
[23]
A. Yang,et al., Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
A. Grattafiori,et al., The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
T. Olmo,et al., Olmo 3.arXiv preprint arXiv:2512.13961(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
R. Kirk,et al., Understanding the effects of rlhf on llm generalisation and diversity.arXiv preprint arXiv:2310.06452(2023)
-
[27]
D. Linsley,et al., Performance-optimized deep neural networks are evolving into worse models of inferotemporal visual cortex.Advances in Neural Information Processing Systems36, 28873– 28891 (2023). 15
work page 2023
-
[28]
B.-D. Oh, W. Schuler, Why does surprisal from larger transformer-based language models provide a poorer fit to human reading times?Transactions of the Association for Computational Linguistics11, 336–350 (2023)
work page 2023
- [29]
-
[30]
K. M. Collins,et al., Evaluating Language Models’ Evaluations of Games.arXiv preprint arXiv:2510.10930(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
G. Gigerenzer, W. Gaissmaier, Heuristic decision making.Annual review of psychology 62(2011), 451–482 (2011)
work page 2011
-
[32]
M. Binz, S. J. Gershman, E. Schulz, D. Endres, Heuristics from bounded meta-learned infer- ence.Psychological review129(5), 1042 (2022)
work page 2022
-
[33]
L. Salewski, S. Alaniz, I. Rio-Torto, E. Schulz, Z. Akata, In-context impersonation reveals large language models’ strengths and biases.Advances in neural information processing systems36, 72044–72057 (2023)
work page 2023
-
[34]
S. Wu,et al., HumanLM: Simulating Users with State Alignment Beats Response Imitation. arXiv preprint arXiv:2603.03303(2026)
-
[35]
D. Paglieri, L. Cross, W. A. Cunningham, J. Z. Leibo, A. S. Vezhnevets, Persona Generators: Generating Diverse Synthetic Personas at Scale.arXiv preprint arXiv:2602.03545(2026)
-
[36]
Y. Ma,et al., Synthetic Interaction Data for Scalable Personalization in Large Language Models. arXiv preprint arXiv:2602.12394(2026)
-
[37]
L. P. Argyle,et al., Out of one, many: Using language models to simulate human samples. Political Analysis31(3), 337–351 (2023)
work page 2023
- [38]
- [39]
-
[40]
arXiv preprint arXiv:2603.17218 , year=
E. Shapira, M. Tennenholtz, R. Reichart, Alignment Makes Language Models Normative, Not Descriptive.arXiv preprint arXiv:2603.17218(2026)
-
[41]
Reinhart,et al., Do LLMs write like humans? Variation in grammatical and rhetorical styles
A. Reinhart,et al., Do LLMs write like humans? Variation in grammatical and rhetorical styles. Proceedings of the National Academy of Sciences122(8), e2422455122 (2025)
work page 2025
-
[42]
T. Kuribayashi, Y. Oseki, T. Baldwin, Psychometric predictive power of large language models, inFindings of the Association for Computational Linguistics: NAACL 2024(2024), pp. 1983– 2005
work page 2024
-
[43]
Ouyang,et al., Training language models to follow instructions with human feedback
L. Ouyang,et al., Training language models to follow instructions with human feedback. Advances in neural information processing systems35, 27730–27744 (2022)
work page 2022
-
[44]
Y. Lin,et al., Mitigating the alignment tax of rlhf, inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing(2024), pp. 580–606. 17 Supplementary Materials for Post-training makes large language models less human-like This PDF file includes: Materials and Methods Figures S1 and S2 S1 Materials and Methods Data collection P...
work page 2024
-
[45]
were reserved as a held-out test set. The submitted datasets underwent a lightweight review process to ensure that they did not contain obvious formatting or implementation bugs. Large language models We conducted evaluations using three major families onPsych-201: Qwen3, a state-of-the-art open-source model family; Llama3.X, a model family with a broad e...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.