HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns
Pith reviewed 2026-05-16 14:30 UTC · model grok-4.3
The pith
Modeling human cognitive patterns as interacting forces lets an 8B LLM outperform a 32B model on realistic multi-pattern role-play.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Treating psychological patterns as interacting causal forces, the authors extract 244 patterns from roughly 12,000 academic papers and synthesize 11,359 scenarios in which 2–5 patterns interact. Dual-level checklists measure both individual pattern fidelity and emergent multi-pattern dynamics, achieving r=0.90 alignment with human judgments while exposing that holistic metrics conflate simulation accuracy with social desirability. HumanLLM-8B, trained on this data, outperforms Qwen3-32B on multi-pattern dynamics despite four times fewer parameters, establishing that authentic anthropomorphism requires cognitive modeling of the processes behind human behaviors.
What carries the argument
Dual-level checklists that separately evaluate fidelity to individual psychological patterns and to their emergent interactions in multi-turn scenarios expressing inner thoughts, actions, and dialogue.
If this is right
- Standard holistic metrics for role-playing agents reward socially desirable responses over accurate simulation of cognitive processes.
- Explicit training on pattern-interaction data produces larger gains in anthropomorphism than scaling model size alone.
- Scenarios built from conflicting or modulating patterns expose limitations hidden by single-pattern or surface-level tests.
- Process-level cognitive modeling transfers to better handling of complex behavioral dynamics with fewer parameters.
Where Pith is reading between the lines
- The same pattern-extraction method could be applied to non-academic sources to test cultural generalizability of the benchmark.
- Applications such as therapeutic chatbots or educational agents would benefit from prioritizing process fidelity over surface realism.
- Future benchmarks could add real-time human interaction loops to validate whether LLM performance holds when patterns unfold without pre-synthesis.
Load-bearing premise
The 244 patterns drawn from academic papers and the 11,359 synthesized scenarios faithfully represent real human cognitive interactions without meaningful selection or synthesis bias.
What would settle it
Record actual humans placed in situations that match the paper's pattern combinations and compare their inner thoughts, actions, and dialogue against the outputs of the trained LLMs.
Figures
read the original abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and generation, serving as the foundation for advanced persona simulation and Role-Playing Language Agents (RPLAs). However, achieving authentic alignment with human cognitive and behavioral patterns remains a critical challenge for these agents. We present HumanLLM, a framework treating psychological patterns as interacting causal forces. We construct 244 patterns from $\sim$12,000 academic papers and synthesize 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other, with multi-turn conversations expressing inner thoughts, actions, and dialogue. Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment ($r=0.90$) while revealing that holistic metrics conflate simulation accuracy with social desirability. HumanLLM-8B outperforms Qwen3-32B on multi-pattern dynamics despite 4$\times$ fewer parameters, demonstrating that authentic anthropomorphism requires cognitive modeling -- simulating not just what humans do, but the psychological processes generating those behaviors. Our dataset, code, and model are available at:https://github.com/YJGoodbye2024/HumanLLM
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HumanLLM, a framework for benchmarking LLM anthropomorphism by extracting 244 psychological patterns from ~12,000 academic papers, synthesizing 11,359 multi-pattern interaction scenarios (2-5 patterns per scenario), and applying dual-level checklists to measure individual pattern fidelity and emergent dynamics. It reports r=0.90 alignment with human raters and claims that a fine-tuned HumanLLM-8B model outperforms Qwen3-32B on multi-pattern tasks despite 4x fewer parameters, arguing that authentic anthropomorphism requires explicit cognitive process modeling rather than behavioral mimicry alone.
Significance. If the benchmark construction and human alignment hold under scrutiny, the work offers a structured, reproducible resource for evaluating psychological fidelity in LLMs and role-playing agents, with the open release of dataset, code, and model strengthening its potential impact on human-AI interaction research.
major comments (4)
- [Abstract] Abstract: The reported r=0.90 human alignment is presented without any information on the number of raters, inter-rater reliability statistics (e.g., Cohen's kappa or intraclass correlation), rater instructions, or controls for rater bias, which directly undermines the strength of the central alignment claim.
- [Pattern extraction and scenario synthesis] Pattern extraction and scenario synthesis sections: The process for deriving 244 patterns from ~12,000 papers and synthesizing 11,359 scenarios lacks explicit criteria for pattern selection, validation against independent human behavioral data, or controls for selection/synthesis bias (e.g., over-representation of Western/clinical constructs), which is load-bearing for the claim that the benchmark faithfully represents real cognitive interactions.
- [Results] Results on model comparison: The claim that HumanLLM-8B outperforms Qwen3-32B on multi-pattern dynamics requires details on the fine-tuning procedure, training data composition, hyperparameter settings, and any controls for parameter count or baseline differences; without these, the size-vs-performance result cannot be interpreted as evidence for cognitive modeling.
- [Evaluation methodology] Evaluation methodology: The dual-level checklists are described at a high level but lack concrete examples of checklist items, scoring rubrics, or how emergent dynamics are distinguished from individual pattern fidelity, making it impossible to assess whether holistic metrics truly conflate simulation accuracy with social desirability as asserted.
minor comments (2)
- [Abstract] The GitHub link in the abstract should include a direct pointer to the exact dataset version and checklist templates used in the reported experiments.
- [Abstract] Notation for the correlation coefficient (r=0.90) should specify whether it is Pearson's r and include the associated p-value or confidence interval.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment point by point below. Revisions have been made to improve transparency and completeness where the concerns are valid.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reported r=0.90 human alignment is presented without any information on the number of raters, inter-rater reliability statistics (e.g., Cohen's kappa or intraclass correlation), rater instructions, or controls for rater bias, which directly undermines the strength of the central alignment claim.
Authors: We agree that the abstract should be self-contained on this central claim. The full protocol—including the number of raters, inter-rater reliability (Cohen’s kappa and ICC), rater instructions, and bias controls—is described in Section 4.2. We have revised the abstract to include a concise summary of these elements so readers can immediately assess the alignment evidence. revision: yes
-
Referee: [Pattern extraction and scenario synthesis] Pattern extraction and scenario synthesis sections: The process for deriving 244 patterns from ~12,000 papers and synthesizing 11,359 scenarios lacks explicit criteria for pattern selection, validation against independent human behavioral data, or controls for selection/synthesis bias (e.g., over-representation of Western/clinical constructs), which is load-bearing for the claim that the benchmark faithfully represents real cognitive interactions.
Authors: We acknowledge the need for greater explicitness. Section 3.1 outlines a systematic literature review, but we have added a new subsection with precise inclusion/exclusion criteria, cross-validation against independent behavioral datasets, and explicit discussion of source diversity and mitigation of Western/clinical bias. These additions directly support the claim of faithful representation. revision: yes
-
Referee: [Results] Results on model comparison: The claim that HumanLLM-8B outperforms Qwen3-32B on multi-pattern dynamics requires details on the fine-tuning procedure, training data composition, hyperparameter settings, and any controls for parameter count or baseline differences; without these, the size-vs-performance result cannot be interpreted as evidence for cognitive modeling.
Authors: We agree additional detail is required for interpretability. Appendix B already contains the fine-tuning procedure, training data composition (the 11,359 scenarios), hyperparameters, and LoRA settings. We have expanded the main results section with a summary of these details plus parameter-matched baseline comparisons, allowing readers to evaluate whether the gains stem from cognitive modeling. revision: yes
-
Referee: [Evaluation methodology] Evaluation methodology: The dual-level checklists are described at a high level but lack concrete examples of checklist items, scoring rubrics, or how emergent dynamics are distinguished from individual pattern fidelity, making it impossible to assess whether holistic metrics truly conflate simulation accuracy with social desirability as asserted.
Authors: We have added concrete examples of individual-pattern and emergent-dynamics checklist items, the full scoring rubrics, and a new paragraph clarifying how interaction effects are isolated from summed individual scores. These revisions appear in the revised Section 4 and Appendix C, directly addressing the distinction and the potential conflation with social desirability. revision: yes
Circularity Check
No significant circularity; benchmark and claims derived from external literature
full rationale
The paper extracts 244 patterns from ~12,000 external academic papers and synthesizes 11,359 scenarios from them, then evaluates via human raters (r=0.90) and compares HumanLLM-8B against Qwen3-32B on the resulting benchmark. No equations, fitted parameters renamed as predictions, or self-citation chains reduce any reported result to the authors' own inputs by construction. The central claim rests on external literature and independent human validation rather than tautological self-definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Psychological patterns identified in academic literature can be treated as interacting causal forces that generate observable human behavior.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We construct 244 patterns from ∼12,000 academic papers and synthesize 11,359 scenarios where 2–5 patterns reinforce, conflict, or modulate each other
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Large language models show amplified cognitive biases in moral decision-making.Pro- ceedings of the National Academy of Sciences, 122(25):e2412015122. Robert B Cialdini and 1 others. 2009.Influence: Sci- ence and practice, volume 4. Pearson education Boston. Lee J Cronbach and Paul E Meehl. 1955. Construct va- lidity in psychological tests.Psychological b...
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[2]
Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu
Role play with large language models.Nature, 623(7987):493–498. Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu
-
[3]
Character-llm: A trainable agent for role-playing.arXiv preprint arXiv:2310.10158, 2023
Character-LLM: A Trainable Agent for Role- Playing.Preprint, arXiv:2310.10158. Amos Tversky and Daniel Kahneman. 1974. Judgment under uncertainty: Heuristics and biases: Biases in judgments reveal some heuristics of thinking under uncertainty.science, 185(4157):1124–1131. Tomer Ullman. 2023. Large language models fail on trivial alterations to theory-of-m...
-
[4]
The Rise of AI Companions: Interaction with AI Companions and Psychological Well-being
Evaluating character understanding of large language models via character profiling from fictional works. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8015–8036, Miami, Florida, USA. Association for Computational Linguistics. Yutong Zhang, Dora Zhao, Jeffrey T. Hancock, Robert Kraut, and Diyi Yang. 2025. ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Pattern-Accurate Behavior: Nouman’s di- alogue precisely instantiates theultimate at- tribution error—framing engineering’s short- comings as unavoidable situational constraints (“innovative interface,” “timeline pressure”) while characterizing marketing’s concerns in dispositional terms (“subjective,” “panic,” “catastrophizing”)
-
[6]
contra- dicts his core identity as a data-driven engi- neer
LLM Misinterpretation: The holistic judge accurately detectsthe defensive, dismissive behavior butmisinterpretsit as a quality de- fect. It penalizes the model for generating an unlikable character despite the instruction to exhibit attribution bias. The criticism “contra- dicts his core identity as a data-driven engi- neer” reveals the judge’s failure to...
-
[7]
Does Nouman attribute fail- ure to external factors?
Checklist Solution: Our checklist poses value-neutral questions derived from the pat- tern definition: “Does Nouman attribute fail- ure to external factors?” rather than “Does Nouman show empathy?” This decouples sim- ulation accuracy from social desirability. The 88-point gap between LLM (5/100) and hu- man (93.3/100) Anthropomorphism scores repre- sents...
-
[8]
Foundational Definition & Description: Literature providing authoritative defini- tions and elucidating the core phenomenon
-
[9]
Core Mechanisms & Theoretical Explanations: Literature exploring underlying evolutionary, cognitive, or emotional drivers
-
[10]
Output: Provide 50 references in APA format, categorized by theme
Real-World Impact & Application: Literature researching manifestations, im- pacts, and practical applications, including double-edged effects and applications in management, marketing, or clinical therapy. Output: Provide 50 references in APA format, categorized by theme. Table 14: Literature retrieval prompt for social-cognitive patterns. Pattern Structu...
-
[11]
Depth and Rigor: Ensure scientific, rigorous analysis. [START_CORPUS] {ALL 50 PAPERS’ CONTENT} [END_CORPUS] Table 15: Pattern structure summary prompt for personality traits. 23 Pattern Structure Summary Prompt for Social-Cognitive Patterns System Prompt You are an expert academic synthesizer and psychological researcher. Your task is to process a large t...
-
[12]
Strict Source Adherence: Base all conclusionsexclusivelyon the provided text corpus
-
[13]
No JSON: Output must be plain text with Markdown headings
-
[14]
Depth and Rigor: Ensure scientific, rigorous, profound analysis. [START_CORPUS] {ALL 50 PAPERS’ CONTENT} [END_CORPUS] Table 16: Pattern structure summary prompt for social-cognitive patterns. 24 Scenario Synthesis Prompt (Part 1 of 2) System Prompt Role: You are a dual-specialist: an expert psychologist and creative screenwriter for scenario generation, a...
-
[15]
Psychological/Behavioral Patterns:{pattern_information}
-
[16]
Situational Framework:{situation}
-
[17]
Candidate Names:{candidate_names}(5 Males, 5 Females) [CRITICAL CONSTRAINT - NAMES]: You must select the Protagonist and all Supporting Characters STRICTLY from the provided “Candidate Names” list. You cannot invent new names. # Task 1: The Design Process (Analytical) Adopt your role as the “rigorous narrative analyst”. Length: UNDER 500 TOKENS
-
[18]
Design Rationale: In 2-4 sentences, explain where each input pattern will be reflected in the scenario
-
[19]
Catalyst Details: Using bullet points, identify critical details that will act as ‘catalysts’
-
[20]
expert psychologist and creative screenwriter
Expected Character Tendencies: For ALL characters, list their most likely cognitive or behavioral tendencies. * Format Requirement (STRICT): @ [Character Name]: 1. [Tendency1]; 2. [Tendency2]; 3. [Tendency3] * Each character on a separate line, starting with @. * Character name in [ ], tendencies numbered and separated by ;. # Task 2: The Scenario Executi...
-
[21]
About Self (Objective/Full Profile): * Identity & Personality (4+ distinct descriptors) * Relevant Background (1-2 sentences) * Motivation in this scenario
-
[22]
the authentic reaction of a multi-dimensional person in a specific situation
About Others (Subjective/Visible Profile): * For EACH other character, describe the relationship from current character’s perspective. Table 17: Scenario generation prompt (Part 1 of 2). 25 Scenario Synthesis Prompt (Part 2 of 2): Output Format User Prompt (cont.) ## Core Creative Mindset for Task 2 * Compatibility: Create a context where patterns emerge ...
-
[23]
**Principles**:{pattern_information}
-
[24]
**Scenario**:{scenario}
-
[25]
**Protagonist**:{protagonist}
-
[26]
**Supporting Characters**:{supporting_characters}
-
[27]
**Design Analysis**:{analysis} **Output Requirements & Formatting:**
-
[28]
**Content:** Create a multi-turn dialogue between the **Protagonist** and **Sup- porting Characters**. **Strictly limit participants to provided characters; do not introduce new characters.** The dialogue should contain **between 12 and 20 indi- vidual speaking turns**
-
[29]
* **Closer**: Dialogue **must conclude** with the Protagonist
**Mandatory Flow (Start & End)**: * **Opener**: Dialogue **must begin** with a Supporting Character. * **Closer**: Dialogue **must conclude** with the Protagonist
-
[30]
One character must completely finish their turn before the next begins
**Turn Structure**: Strictly turn-based format. One character must completely finish their turn before the next begins. No interruptions or overlapping speech
-
[31]
**Trinity of Expression**: Seamlessly integrate **inner thought, external action, and spoken dialogue** throughout
-
[32]
* Actions/expressions/behaviors: Use (parentheses)
**Strict Formatting Rules**: * Inner thoughts/psychology: Use [square brackets]. * Actions/expressions/behaviors: Use (parentheses). * Spoken dialogue: Use no brackets. * Example: Hermione: [I have to devise a foolproof plan.] (She quickly draws her wand) Harry, use the flute, now!
-
[33]
Table 19: Conversation synthesis prompt (Part 1 of 2)
**No Preamble**: Do not begin with introductory text. Table 19: Conversation synthesis prompt (Part 1 of 2). 27 Conversation Synthesis Prompt (Part 2 of 2) User Prompt (cont.) **Core Creative Principles:**
-
[34]
**Focus and Breathing Room**: This is the most crucial principle. You do **not** need to have every minor gesture or piece of small talk carry the weight of a psychological principle. Use the principles as a “**spotlight**” to illuminate and explain **the most critical turning points, the core conflicts, or the moments that best define the characters’ arc...
-
[35]
**Show, Don’t Tell**: Never allow characters to openly state or explain the psychological principles by name. Instead, you must **show** how the principles influence their judgment and choices through their concrete actions (the combination of thoughts, dialogue, and physical behavior)
-
[36]
**Psychology Drives Action**: In the key moments illuminated by the “spotlight,” the character’s [inner thought] should be the origin of their behavior, directly reflecting the influence of a psychological principle. The subsequent dialogue and (actions) should be the logical, external expression of that internal state
-
[37]
**Seamless Integration**: Weave the principles into the natural flow of the story. The entire dialogue should feel like an authentic interaction, not a contrived demonstration for a psychology case study. Table 20: Conversation synthesis prompt (Part 2 of 2). Training Role-Playing Instruction Template System Prompt You are{protagonist_name}. ==About{prota...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.