pith. sign in

arxiv: 2601.10198 · v4 · submitted 2026-01-15 · 💻 cs.CL

HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns

Pith reviewed 2026-05-16 14:30 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM anthropomorphismcognitive patternsrole-playing agentsbenchmarkpsychological modelingmulti-pattern dynamicshuman alignmentpersona simulation
0
0 comments X

The pith

Modeling human cognitive patterns as interacting forces lets an 8B LLM outperform a 32B model on realistic multi-pattern role-play.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The HumanLLM framework extracts 244 psychological patterns from academic papers and builds 11,359 scenarios in which two to five patterns reinforce, conflict, or modulate one another. It evaluates models with dual checklists that separately score fidelity to single patterns and to the emergent dynamics across multi-turn conversations that include inner thoughts, actions, and dialogue. The resulting HumanLLM-8B model surpasses Qwen3-32B on the harder multi-pattern tasks despite four times fewer parameters, showing that authentic anthropomorphism depends on simulating the psychological processes that generate behavior rather than merely reproducing surface actions.

Core claim

Treating psychological patterns as interacting causal forces, the authors extract 244 patterns from roughly 12,000 academic papers and synthesize 11,359 scenarios in which 2–5 patterns interact. Dual-level checklists measure both individual pattern fidelity and emergent multi-pattern dynamics, achieving r=0.90 alignment with human judgments while exposing that holistic metrics conflate simulation accuracy with social desirability. HumanLLM-8B, trained on this data, outperforms Qwen3-32B on multi-pattern dynamics despite four times fewer parameters, establishing that authentic anthropomorphism requires cognitive modeling of the processes behind human behaviors.

What carries the argument

Dual-level checklists that separately evaluate fidelity to individual psychological patterns and to their emergent interactions in multi-turn scenarios expressing inner thoughts, actions, and dialogue.

If this is right

  • Standard holistic metrics for role-playing agents reward socially desirable responses over accurate simulation of cognitive processes.
  • Explicit training on pattern-interaction data produces larger gains in anthropomorphism than scaling model size alone.
  • Scenarios built from conflicting or modulating patterns expose limitations hidden by single-pattern or surface-level tests.
  • Process-level cognitive modeling transfers to better handling of complex behavioral dynamics with fewer parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pattern-extraction method could be applied to non-academic sources to test cultural generalizability of the benchmark.
  • Applications such as therapeutic chatbots or educational agents would benefit from prioritizing process fidelity over surface realism.
  • Future benchmarks could add real-time human interaction loops to validate whether LLM performance holds when patterns unfold without pre-synthesis.

Load-bearing premise

The 244 patterns drawn from academic papers and the 11,359 synthesized scenarios faithfully represent real human cognitive interactions without meaningful selection or synthesis bias.

What would settle it

Record actual humans placed in situations that match the paper's pattern combinations and compare their inner thoughts, actions, and dialogue against the outputs of the trained LLMs.

Figures

Figures reproduced from arXiv: 2601.10198 by Hongwei Feng, Jen-tse Huang, Jian Yang, Jun Gao, Rui Xie, Shuai Huang, Weiyuan Li, Xintao Wang, Yanghua Xiao, Yuanli Gou, Yueping Kang.

Figure 1
Figure 1. Figure 1: Pattern Data Structure: 144 Social￾Cognitive Patterns (left) and 100 Personality Traits (right). Each pattern comprises Definition, Core Mecha￾nisms, and Real-World Manifestations cognitive and emotional fidelity—what we term psychological alignment (Wang et al., 2024). However, existing approaches model person￾ality as isolated label-to-behavior mappings— “extroverted” maps to “talkative,” “agreeable” map… view at source ↗
Figure 2
Figure 2. Figure 2: HumanLLM Framework. Left: Dataset structure with scenarios, multi-turn conversations (inner thoughts in brackets, actions in parentheses), and dual-level checklists. Top Right: Supervised fine-tuning on target character utterances. Bottom Right: Evaluation via LLM judge scoring against pattern-level and scenario-level checklists. and humans. Moral reasoning has been assessed through ETHICS (Hendrycks et al… view at source ↗
Figure 3
Figure 3. Figure 3: Human-LLM evaluation alignment comparison. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dataset distributions: (a) number of dialogue [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and generation, serving as the foundation for advanced persona simulation and Role-Playing Language Agents (RPLAs). However, achieving authentic alignment with human cognitive and behavioral patterns remains a critical challenge for these agents. We present HumanLLM, a framework treating psychological patterns as interacting causal forces. We construct 244 patterns from $\sim$12,000 academic papers and synthesize 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other, with multi-turn conversations expressing inner thoughts, actions, and dialogue. Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment ($r=0.90$) while revealing that holistic metrics conflate simulation accuracy with social desirability. HumanLLM-8B outperforms Qwen3-32B on multi-pattern dynamics despite 4$\times$ fewer parameters, demonstrating that authentic anthropomorphism requires cognitive modeling -- simulating not just what humans do, but the psychological processes generating those behaviors. Our dataset, code, and model are available at:https://github.com/YJGoodbye2024/HumanLLM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The paper introduces HumanLLM, a framework for benchmarking LLM anthropomorphism by extracting 244 psychological patterns from ~12,000 academic papers, synthesizing 11,359 multi-pattern interaction scenarios (2-5 patterns per scenario), and applying dual-level checklists to measure individual pattern fidelity and emergent dynamics. It reports r=0.90 alignment with human raters and claims that a fine-tuned HumanLLM-8B model outperforms Qwen3-32B on multi-pattern tasks despite 4x fewer parameters, arguing that authentic anthropomorphism requires explicit cognitive process modeling rather than behavioral mimicry alone.

Significance. If the benchmark construction and human alignment hold under scrutiny, the work offers a structured, reproducible resource for evaluating psychological fidelity in LLMs and role-playing agents, with the open release of dataset, code, and model strengthening its potential impact on human-AI interaction research.

major comments (4)
  1. [Abstract] Abstract: The reported r=0.90 human alignment is presented without any information on the number of raters, inter-rater reliability statistics (e.g., Cohen's kappa or intraclass correlation), rater instructions, or controls for rater bias, which directly undermines the strength of the central alignment claim.
  2. [Pattern extraction and scenario synthesis] Pattern extraction and scenario synthesis sections: The process for deriving 244 patterns from ~12,000 papers and synthesizing 11,359 scenarios lacks explicit criteria for pattern selection, validation against independent human behavioral data, or controls for selection/synthesis bias (e.g., over-representation of Western/clinical constructs), which is load-bearing for the claim that the benchmark faithfully represents real cognitive interactions.
  3. [Results] Results on model comparison: The claim that HumanLLM-8B outperforms Qwen3-32B on multi-pattern dynamics requires details on the fine-tuning procedure, training data composition, hyperparameter settings, and any controls for parameter count or baseline differences; without these, the size-vs-performance result cannot be interpreted as evidence for cognitive modeling.
  4. [Evaluation methodology] Evaluation methodology: The dual-level checklists are described at a high level but lack concrete examples of checklist items, scoring rubrics, or how emergent dynamics are distinguished from individual pattern fidelity, making it impossible to assess whether holistic metrics truly conflate simulation accuracy with social desirability as asserted.
minor comments (2)
  1. [Abstract] The GitHub link in the abstract should include a direct pointer to the exact dataset version and checklist templates used in the reported experiments.
  2. [Abstract] Notation for the correlation coefficient (r=0.90) should specify whether it is Pearson's r and include the associated p-value or confidence interval.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below. Revisions have been made to improve transparency and completeness where the concerns are valid.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported r=0.90 human alignment is presented without any information on the number of raters, inter-rater reliability statistics (e.g., Cohen's kappa or intraclass correlation), rater instructions, or controls for rater bias, which directly undermines the strength of the central alignment claim.

    Authors: We agree that the abstract should be self-contained on this central claim. The full protocol—including the number of raters, inter-rater reliability (Cohen’s kappa and ICC), rater instructions, and bias controls—is described in Section 4.2. We have revised the abstract to include a concise summary of these elements so readers can immediately assess the alignment evidence. revision: yes

  2. Referee: [Pattern extraction and scenario synthesis] Pattern extraction and scenario synthesis sections: The process for deriving 244 patterns from ~12,000 papers and synthesizing 11,359 scenarios lacks explicit criteria for pattern selection, validation against independent human behavioral data, or controls for selection/synthesis bias (e.g., over-representation of Western/clinical constructs), which is load-bearing for the claim that the benchmark faithfully represents real cognitive interactions.

    Authors: We acknowledge the need for greater explicitness. Section 3.1 outlines a systematic literature review, but we have added a new subsection with precise inclusion/exclusion criteria, cross-validation against independent behavioral datasets, and explicit discussion of source diversity and mitigation of Western/clinical bias. These additions directly support the claim of faithful representation. revision: yes

  3. Referee: [Results] Results on model comparison: The claim that HumanLLM-8B outperforms Qwen3-32B on multi-pattern dynamics requires details on the fine-tuning procedure, training data composition, hyperparameter settings, and any controls for parameter count or baseline differences; without these, the size-vs-performance result cannot be interpreted as evidence for cognitive modeling.

    Authors: We agree additional detail is required for interpretability. Appendix B already contains the fine-tuning procedure, training data composition (the 11,359 scenarios), hyperparameters, and LoRA settings. We have expanded the main results section with a summary of these details plus parameter-matched baseline comparisons, allowing readers to evaluate whether the gains stem from cognitive modeling. revision: yes

  4. Referee: [Evaluation methodology] Evaluation methodology: The dual-level checklists are described at a high level but lack concrete examples of checklist items, scoring rubrics, or how emergent dynamics are distinguished from individual pattern fidelity, making it impossible to assess whether holistic metrics truly conflate simulation accuracy with social desirability as asserted.

    Authors: We have added concrete examples of individual-pattern and emergent-dynamics checklist items, the full scoring rubrics, and a new paragraph clarifying how interaction effects are isolated from summed individual scores. These revisions appear in the revised Section 4 and Appendix C, directly addressing the distinction and the potential conflation with social desirability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark and claims derived from external literature

full rationale

The paper extracts 244 patterns from ~12,000 external academic papers and synthesizes 11,359 scenarios from them, then evaluates via human raters (r=0.90) and compares HumanLLM-8B against Qwen3-32B on the resulting benchmark. No equations, fitted parameters renamed as predictions, or self-citation chains reduce any reported result to the authors' own inputs by construction. The central claim rests on external literature and independent human validation rather than tautological self-definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that psychological patterns extracted from papers can be treated as independent causal forces whose interactions can be faithfully synthesized into scenarios; no free parameters are explicitly fitted in the abstract, and no new physical entities are postulated.

axioms (1)
  • domain assumption Psychological patterns identified in academic literature can be treated as interacting causal forces that generate observable human behavior.
    Invoked when the framework is introduced and when scenarios are synthesized from 2-5 patterns.

pith-pipeline@v0.9.0 · 5538 in / 1363 out tokens · 25352 ms · 2026-05-16T14:30:19.524868+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 2 internal anchors

  1. [1]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Large language models show amplified cognitive biases in moral decision-making.Pro- ceedings of the National Academy of Sciences, 122(25):e2412015122. Robert B Cialdini and 1 others. 2009.Influence: Sci- ence and practice, volume 4. Pearson education Boston. Lee J Cronbach and Paul E Meehl. 1955. Construct va- lidity in psychological tests.Psychological b...

  2. [2]

    Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu

    Role play with large language models.Nature, 623(7987):493–498. Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu

  3. [3]

    Character-llm: A trainable agent for role-playing.arXiv preprint arXiv:2310.10158, 2023

    Character-LLM: A Trainable Agent for Role- Playing.Preprint, arXiv:2310.10158. Amos Tversky and Daniel Kahneman. 1974. Judgment under uncertainty: Heuristics and biases: Biases in judgments reveal some heuristics of thinking under uncertainty.science, 185(4157):1124–1131. Tomer Ullman. 2023. Large language models fail on trivial alterations to theory-of-m...

  4. [4]

    The Rise of AI Companions: Interaction with AI Companions and Psychological Well-being

    Evaluating character understanding of large language models via character profiling from fictional works. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8015–8036, Miami, Florida, USA. Association for Computational Linguistics. Yutong Zhang, Dora Zhao, Jeffrey T. Hancock, Robert Kraut, and Diyi Yang. 2025. ...

  5. [5]

    innovative interface,

    Pattern-Accurate Behavior: Nouman’s di- alogue precisely instantiates theultimate at- tribution error—framing engineering’s short- comings as unavoidable situational constraints (“innovative interface,” “timeline pressure”) while characterizing marketing’s concerns in dispositional terms (“subjective,” “panic,” “catastrophizing”)

  6. [6]

    contra- dicts his core identity as a data-driven engi- neer

    LLM Misinterpretation: The holistic judge accurately detectsthe defensive, dismissive behavior butmisinterpretsit as a quality de- fect. It penalizes the model for generating an unlikable character despite the instruction to exhibit attribution bias. The criticism “contra- dicts his core identity as a data-driven engi- neer” reveals the judge’s failure to...

  7. [7]

    Does Nouman attribute fail- ure to external factors?

    Checklist Solution: Our checklist poses value-neutral questions derived from the pat- tern definition: “Does Nouman attribute fail- ure to external factors?” rather than “Does Nouman show empathy?” This decouples sim- ulation accuracy from social desirability. The 88-point gap between LLM (5/100) and hu- man (93.3/100) Anthropomorphism scores repre- sents...

  8. [8]

    Foundational Definition & Description: Literature providing authoritative defini- tions and elucidating the core phenomenon

  9. [9]

    Core Mechanisms & Theoretical Explanations: Literature exploring underlying evolutionary, cognitive, or emotional drivers

  10. [10]

    Output: Provide 50 references in APA format, categorized by theme

    Real-World Impact & Application: Literature researching manifestations, im- pacts, and practical applications, including double-edged effects and applications in management, marketing, or clinical therapy. Output: Provide 50 references in APA format, categorized by theme. Table 14: Literature retrieval prompt for social-cognitive patterns. Pattern Structu...

  11. [11]

    [START_CORPUS] {ALL 50 PAPERS’ CONTENT} [END_CORPUS] Table 15: Pattern structure summary prompt for personality traits

    Depth and Rigor: Ensure scientific, rigorous analysis. [START_CORPUS] {ALL 50 PAPERS’ CONTENT} [END_CORPUS] Table 15: Pattern structure summary prompt for personality traits. 23 Pattern Structure Summary Prompt for Social-Cognitive Patterns System Prompt You are an expert academic synthesizer and psychological researcher. Your task is to process a large t...

  12. [12]

    Strict Source Adherence: Base all conclusionsexclusivelyon the provided text corpus

  13. [13]

    No JSON: Output must be plain text with Markdown headings

  14. [14]

    [START_CORPUS] {ALL 50 PAPERS’ CONTENT} [END_CORPUS] Table 16: Pattern structure summary prompt for social-cognitive patterns

    Depth and Rigor: Ensure scientific, rigorous, profound analysis. [START_CORPUS] {ALL 50 PAPERS’ CONTENT} [END_CORPUS] Table 16: Pattern structure summary prompt for social-cognitive patterns. 24 Scenario Synthesis Prompt (Part 1 of 2) System Prompt Role: You are a dual-specialist: an expert psychologist and creative screenwriter for scenario generation, a...

  15. [15]

    Psychological/Behavioral Patterns:{pattern_information}

  16. [16]

    Situational Framework:{situation}

  17. [17]

    Candidate Names

    Candidate Names:{candidate_names}(5 Males, 5 Females) [CRITICAL CONSTRAINT - NAMES]: You must select the Protagonist and all Supporting Characters STRICTLY from the provided “Candidate Names” list. You cannot invent new names. # Task 1: The Design Process (Analytical) Adopt your role as the “rigorous narrative analyst”. Length: UNDER 500 TOKENS

  18. [18]

    Design Rationale: In 2-4 sentences, explain where each input pattern will be reflected in the scenario

  19. [19]

    Catalyst Details: Using bullet points, identify critical details that will act as ‘catalysts’

  20. [20]

    expert psychologist and creative screenwriter

    Expected Character Tendencies: For ALL characters, list their most likely cognitive or behavioral tendencies. * Format Requirement (STRICT): @ [Character Name]: 1. [Tendency1]; 2. [Tendency2]; 3. [Tendency3] * Each character on a separate line, starting with @. * Character name in [ ], tendencies numbered and separated by ;. # Task 2: The Scenario Executi...

  21. [21]

    About Self (Objective/Full Profile): * Identity & Personality (4+ distinct descriptors) * Relevant Background (1-2 sentences) * Motivation in this scenario

  22. [22]

    the authentic reaction of a multi-dimensional person in a specific situation

    About Others (Subjective/Visible Profile): * For EACH other character, describe the relationship from current character’s perspective. Table 17: Scenario generation prompt (Part 1 of 2). 25 Scenario Synthesis Prompt (Part 2 of 2): Output Format User Prompt (cont.) ## Core Creative Mindset for Task 2 * Compatibility: Create a context where patterns emerge ...

  23. [23]

    **Principles**:{pattern_information}

  24. [24]

    **Scenario**:{scenario}

  25. [25]

    **Protagonist**:{protagonist}

  26. [26]

    **Supporting Characters**:{supporting_characters}

  27. [27]

    **Design Analysis**:{analysis} **Output Requirements & Formatting:**

  28. [28]

    **Strictly limit participants to provided characters; do not introduce new characters.** The dialogue should contain **between 12 and 20 indi- vidual speaking turns**

    **Content:** Create a multi-turn dialogue between the **Protagonist** and **Sup- porting Characters**. **Strictly limit participants to provided characters; do not introduce new characters.** The dialogue should contain **between 12 and 20 indi- vidual speaking turns**

  29. [29]

    * **Closer**: Dialogue **must conclude** with the Protagonist

    **Mandatory Flow (Start & End)**: * **Opener**: Dialogue **must begin** with a Supporting Character. * **Closer**: Dialogue **must conclude** with the Protagonist

  30. [30]

    One character must completely finish their turn before the next begins

    **Turn Structure**: Strictly turn-based format. One character must completely finish their turn before the next begins. No interruptions or overlapping speech

  31. [31]

    **Trinity of Expression**: Seamlessly integrate **inner thought, external action, and spoken dialogue** throughout

  32. [32]

    * Actions/expressions/behaviors: Use (parentheses)

    **Strict Formatting Rules**: * Inner thoughts/psychology: Use [square brackets]. * Actions/expressions/behaviors: Use (parentheses). * Spoken dialogue: Use no brackets. * Example: Hermione: [I have to devise a foolproof plan.] (She quickly draws her wand) Harry, use the flute, now!

  33. [33]

    Table 19: Conversation synthesis prompt (Part 1 of 2)

    **No Preamble**: Do not begin with introductory text. Table 19: Conversation synthesis prompt (Part 1 of 2). 27 Conversation Synthesis Prompt (Part 2 of 2) User Prompt (cont.) **Core Creative Principles:**

  34. [34]

    **spotlight**

    **Focus and Breathing Room**: This is the most crucial principle. You do **not** need to have every minor gesture or piece of small talk carry the weight of a psychological principle. Use the principles as a “**spotlight**” to illuminate and explain **the most critical turning points, the core conflicts, or the moments that best define the characters’ arc...

  35. [35]

    Instead, you must **show** how the principles influence their judgment and choices through their concrete actions (the combination of thoughts, dialogue, and physical behavior)

    **Show, Don’t Tell**: Never allow characters to openly state or explain the psychological principles by name. Instead, you must **show** how the principles influence their judgment and choices through their concrete actions (the combination of thoughts, dialogue, and physical behavior)

  36. [36]

    spotlight,

    **Psychology Drives Action**: In the key moments illuminated by the “spotlight,” the character’s [inner thought] should be the origin of their behavior, directly reflecting the influence of a psychological principle. The subsequent dialogue and (actions) should be the logical, external expression of that internal state

  37. [37]

    results": [ {

    **Seamless Integration**: Weave the principles into the natural flow of the story. The entire dialogue should feel like an authentic interaction, not a contrived demonstration for a psychology case study. Table 20: Conversation synthesis prompt (Part 2 of 2). Training Role-Playing Instruction Template System Prompt You are{protagonist_name}. ==About{prota...