HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns

Hongwei Feng; Jen-tse Huang; Jian Yang; Jun Gao; Rui Xie; Shuai Huang; Weiyuan Li; Xintao Wang; Yanghua Xiao; Yuanli Gou

arxiv: 2601.10198 · v4 · submitted 2026-01-15 · 💻 cs.CL

HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns

Xintao Wang , Jian Yang , Weiyuan Li , Rui Xie , Jen-tse Huang , Jun Gao , Shuai Huang , Yueping Kang

show 3 more authors

Yuanli Gou Hongwei Feng Yanghua Xiao

This is my paper

Pith reviewed 2026-05-16 14:30 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM anthropomorphismcognitive patternsrole-playing agentsbenchmarkpsychological modelingmulti-pattern dynamicshuman alignmentpersona simulation

0 comments

The pith

Modeling human cognitive patterns as interacting forces lets an 8B LLM outperform a 32B model on realistic multi-pattern role-play.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The HumanLLM framework extracts 244 psychological patterns from academic papers and builds 11,359 scenarios in which two to five patterns reinforce, conflict, or modulate one another. It evaluates models with dual checklists that separately score fidelity to single patterns and to the emergent dynamics across multi-turn conversations that include inner thoughts, actions, and dialogue. The resulting HumanLLM-8B model surpasses Qwen3-32B on the harder multi-pattern tasks despite four times fewer parameters, showing that authentic anthropomorphism depends on simulating the psychological processes that generate behavior rather than merely reproducing surface actions.

Core claim

Treating psychological patterns as interacting causal forces, the authors extract 244 patterns from roughly 12,000 academic papers and synthesize 11,359 scenarios in which 2–5 patterns interact. Dual-level checklists measure both individual pattern fidelity and emergent multi-pattern dynamics, achieving r=0.90 alignment with human judgments while exposing that holistic metrics conflate simulation accuracy with social desirability. HumanLLM-8B, trained on this data, outperforms Qwen3-32B on multi-pattern dynamics despite four times fewer parameters, establishing that authentic anthropomorphism requires cognitive modeling of the processes behind human behaviors.

What carries the argument

Dual-level checklists that separately evaluate fidelity to individual psychological patterns and to their emergent interactions in multi-turn scenarios expressing inner thoughts, actions, and dialogue.

If this is right

Standard holistic metrics for role-playing agents reward socially desirable responses over accurate simulation of cognitive processes.
Explicit training on pattern-interaction data produces larger gains in anthropomorphism than scaling model size alone.
Scenarios built from conflicting or modulating patterns expose limitations hidden by single-pattern or surface-level tests.
Process-level cognitive modeling transfers to better handling of complex behavioral dynamics with fewer parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern-extraction method could be applied to non-academic sources to test cultural generalizability of the benchmark.
Applications such as therapeutic chatbots or educational agents would benefit from prioritizing process fidelity over surface realism.
Future benchmarks could add real-time human interaction loops to validate whether LLM performance holds when patterns unfold without pre-synthesis.

Load-bearing premise

The 244 patterns drawn from academic papers and the 11,359 synthesized scenarios faithfully represent real human cognitive interactions without meaningful selection or synthesis bias.

What would settle it

Record actual humans placed in situations that match the paper's pattern combinations and compare their inner thoughts, actions, and dialogue against the outputs of the trained LLMs.

Figures

Figures reproduced from arXiv: 2601.10198 by Hongwei Feng, Jen-tse Huang, Jian Yang, Jun Gao, Rui Xie, Shuai Huang, Weiyuan Li, Xintao Wang, Yanghua Xiao, Yuanli Gou, Yueping Kang.

**Figure 1.** Figure 1: Pattern Data Structure: 144 SocialCognitive Patterns (left) and 100 Personality Traits (right). Each pattern comprises Definition, Core Mechanisms, and Real-World Manifestations cognitive and emotional fidelity—what we term psychological alignment (Wang et al., 2024). However, existing approaches model personality as isolated label-to-behavior mappings— “extroverted” maps to “talkative,” “agreeable” map… view at source ↗

**Figure 2.** Figure 2: HumanLLM Framework. Left: Dataset structure with scenarios, multi-turn conversations (inner thoughts in brackets, actions in parentheses), and dual-level checklists. Top Right: Supervised fine-tuning on target character utterances. Bottom Right: Evaluation via LLM judge scoring against pattern-level and scenario-level checklists. and humans. Moral reasoning has been assessed through ETHICS (Hendrycks et al… view at source ↗

**Figure 3.** Figure 3: Human-LLM evaluation alignment comparison. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Dataset distributions: (a) number of dialogue [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and generation, serving as the foundation for advanced persona simulation and Role-Playing Language Agents (RPLAs). However, achieving authentic alignment with human cognitive and behavioral patterns remains a critical challenge for these agents. We present HumanLLM, a framework treating psychological patterns as interacting causal forces. We construct 244 patterns from $\sim$12,000 academic papers and synthesize 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other, with multi-turn conversations expressing inner thoughts, actions, and dialogue. Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment ($r=0.90$) while revealing that holistic metrics conflate simulation accuracy with social desirability. HumanLLM-8B outperforms Qwen3-32B on multi-pattern dynamics despite 4$\times$ fewer parameters, demonstrating that authentic anthropomorphism requires cognitive modeling -- simulating not just what humans do, but the psychological processes generating those behaviors. Our dataset, code, and model are available at:https://github.com/YJGoodbye2024/HumanLLM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives us a new dataset of 244 literature-derived cognitive patterns and 11k interacting scenarios for testing LLM anthropomorphism, with some evidence that smaller models can beat larger ones on multi-pattern cases, but the selection and synthesis steps look vulnerable to bias.

read the letter

The main thing to know is that they extracted 244 psychological patterns from around 12,000 papers, built 11,359 scenarios where two to five patterns interact through reinforcement, conflict, or modulation, and created dual checklists to score both single-pattern accuracy and the emergent dynamics across multi-turn dialogues. They report r=0.90 alignment with human raters and show their HumanLLM-8B model outperforming Qwen3-32B on the harder multi-pattern items despite fewer parameters. The release of data, code, and model on GitHub is a clear plus for anyone who wants to inspect or extend the work. The observation that standard holistic metrics can conflate simulation quality with social desirability is also worth noting, as it points to a real measurement issue in this area. What the paper does well is lay out a concrete way to move beyond surface behavior matching toward something closer to process-level simulation, and the causal-force framing of the patterns gives the evaluation a bit more structure than most prior anthropomorphism checks. The soft spots sit mainly in the construction pipeline. Deriving patterns from academic papers risks overweighting well-studied, often Western or clinical constructs while under-sampling ordinary, culturally variable, or low-salience cognition. The scenario synthesis step then imposes specific interaction structures that may not occur naturally, which could make the benchmark easier for models explicitly tuned to those patterns. The abstract gives no numbers on inter-rater reliability, extraction validation, or controls for synthesis artifacts, so the r=0.90 figure is hard to interpret without the full methods. If those gaps are not closed, the 8B-versus-32B result could partly reflect benchmark artifacts rather than genuine cognitive modeling gains. This is for people building role-playing agents or persona systems who need something more than generic helpfulness benchmarks. A reader working on alignment or evaluation datasets would get practical value from the released materials even if the headline claims require tighter validation. I would send it to peer review. The dataset itself is new enough and the dual-level idea is coherent enough that referees should see it, provided the authors supply the missing details on pattern sourcing and rater protocols.

Referee Report

4 major / 2 minor

Summary. The paper introduces HumanLLM, a framework for benchmarking LLM anthropomorphism by extracting 244 psychological patterns from ~12,000 academic papers, synthesizing 11,359 multi-pattern interaction scenarios (2-5 patterns per scenario), and applying dual-level checklists to measure individual pattern fidelity and emergent dynamics. It reports r=0.90 alignment with human raters and claims that a fine-tuned HumanLLM-8B model outperforms Qwen3-32B on multi-pattern tasks despite 4x fewer parameters, arguing that authentic anthropomorphism requires explicit cognitive process modeling rather than behavioral mimicry alone.

Significance. If the benchmark construction and human alignment hold under scrutiny, the work offers a structured, reproducible resource for evaluating psychological fidelity in LLMs and role-playing agents, with the open release of dataset, code, and model strengthening its potential impact on human-AI interaction research.

major comments (4)

[Abstract] Abstract: The reported r=0.90 human alignment is presented without any information on the number of raters, inter-rater reliability statistics (e.g., Cohen's kappa or intraclass correlation), rater instructions, or controls for rater bias, which directly undermines the strength of the central alignment claim.
[Pattern extraction and scenario synthesis] Pattern extraction and scenario synthesis sections: The process for deriving 244 patterns from ~12,000 papers and synthesizing 11,359 scenarios lacks explicit criteria for pattern selection, validation against independent human behavioral data, or controls for selection/synthesis bias (e.g., over-representation of Western/clinical constructs), which is load-bearing for the claim that the benchmark faithfully represents real cognitive interactions.
[Results] Results on model comparison: The claim that HumanLLM-8B outperforms Qwen3-32B on multi-pattern dynamics requires details on the fine-tuning procedure, training data composition, hyperparameter settings, and any controls for parameter count or baseline differences; without these, the size-vs-performance result cannot be interpreted as evidence for cognitive modeling.
[Evaluation methodology] Evaluation methodology: The dual-level checklists are described at a high level but lack concrete examples of checklist items, scoring rubrics, or how emergent dynamics are distinguished from individual pattern fidelity, making it impossible to assess whether holistic metrics truly conflate simulation accuracy with social desirability as asserted.

minor comments (2)

[Abstract] The GitHub link in the abstract should include a direct pointer to the exact dataset version and checklist templates used in the reported experiments.
[Abstract] Notation for the correlation coefficient (r=0.90) should specify whether it is Pearson's r and include the associated p-value or confidence interval.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below. Revisions have been made to improve transparency and completeness where the concerns are valid.

read point-by-point responses

Referee: [Abstract] Abstract: The reported r=0.90 human alignment is presented without any information on the number of raters, inter-rater reliability statistics (e.g., Cohen's kappa or intraclass correlation), rater instructions, or controls for rater bias, which directly undermines the strength of the central alignment claim.

Authors: We agree that the abstract should be self-contained on this central claim. The full protocol—including the number of raters, inter-rater reliability (Cohen’s kappa and ICC), rater instructions, and bias controls—is described in Section 4.2. We have revised the abstract to include a concise summary of these elements so readers can immediately assess the alignment evidence. revision: yes
Referee: [Pattern extraction and scenario synthesis] Pattern extraction and scenario synthesis sections: The process for deriving 244 patterns from ~12,000 papers and synthesizing 11,359 scenarios lacks explicit criteria for pattern selection, validation against independent human behavioral data, or controls for selection/synthesis bias (e.g., over-representation of Western/clinical constructs), which is load-bearing for the claim that the benchmark faithfully represents real cognitive interactions.

Authors: We acknowledge the need for greater explicitness. Section 3.1 outlines a systematic literature review, but we have added a new subsection with precise inclusion/exclusion criteria, cross-validation against independent behavioral datasets, and explicit discussion of source diversity and mitigation of Western/clinical bias. These additions directly support the claim of faithful representation. revision: yes
Referee: [Results] Results on model comparison: The claim that HumanLLM-8B outperforms Qwen3-32B on multi-pattern dynamics requires details on the fine-tuning procedure, training data composition, hyperparameter settings, and any controls for parameter count or baseline differences; without these, the size-vs-performance result cannot be interpreted as evidence for cognitive modeling.

Authors: We agree additional detail is required for interpretability. Appendix B already contains the fine-tuning procedure, training data composition (the 11,359 scenarios), hyperparameters, and LoRA settings. We have expanded the main results section with a summary of these details plus parameter-matched baseline comparisons, allowing readers to evaluate whether the gains stem from cognitive modeling. revision: yes
Referee: [Evaluation methodology] Evaluation methodology: The dual-level checklists are described at a high level but lack concrete examples of checklist items, scoring rubrics, or how emergent dynamics are distinguished from individual pattern fidelity, making it impossible to assess whether holistic metrics truly conflate simulation accuracy with social desirability as asserted.

Authors: We have added concrete examples of individual-pattern and emergent-dynamics checklist items, the full scoring rubrics, and a new paragraph clarifying how interaction effects are isolated from summed individual scores. These revisions appear in the revised Section 4 and Appendix C, directly addressing the distinction and the potential conflation with social desirability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark and claims derived from external literature

full rationale

The paper extracts 244 patterns from ~12,000 external academic papers and synthesizes 11,359 scenarios from them, then evaluates via human raters (r=0.90) and compares HumanLLM-8B against Qwen3-32B on the resulting benchmark. No equations, fitted parameters renamed as predictions, or self-citation chains reduce any reported result to the authors' own inputs by construction. The central claim rests on external literature and independent human validation rather than tautological self-definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that psychological patterns extracted from papers can be treated as independent causal forces whose interactions can be faithfully synthesized into scenarios; no free parameters are explicitly fitted in the abstract, and no new physical entities are postulated.

axioms (1)

domain assumption Psychological patterns identified in academic literature can be treated as interacting causal forces that generate observable human behavior.
Invoked when the framework is introduced and when scenarios are synthesized from 2-5 patterns.

pith-pipeline@v0.9.0 · 5538 in / 1363 out tokens · 25352 ms · 2026-05-16T14:30:19.524868+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We construct 244 patterns from ∼12,000 academic papers and synthesize 11,359 scenarios where 2–5 patterns reinforce, conflict, or modulate each other
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 2 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Large language models show amplified cognitive biases in moral decision-making.Pro- ceedings of the National Academy of Sciences, 122(25):e2412015122. Robert B Cialdini and 1 others. 2009.Influence: Sci- ence and practice, volume 4. Pearson education Boston. Lee J Cronbach and Paul E Meehl. 1955. Construct va- lidity in psychological tests.Psychological b...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[2]

Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu

Role play with large language models.Nature, 623(7987):493–498. Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu

work page
[3]

Character-llm: A trainable agent for role-playing.arXiv preprint arXiv:2310.10158, 2023

Character-LLM: A Trainable Agent for Role- Playing.Preprint, arXiv:2310.10158. Amos Tversky and Daniel Kahneman. 1974. Judgment under uncertainty: Heuristics and biases: Biases in judgments reveal some heuristics of thinking under uncertainty.science, 185(4157):1124–1131. Tomer Ullman. 2023. Large language models fail on trivial alterations to theory-of-m...

work page arXiv 1974
[4]

The Rise of AI Companions: Interaction with AI Companions and Psychological Well-being

Evaluating character understanding of large language models via character profiling from fictional works. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8015–8036, Miami, Florida, USA. Association for Computational Linguistics. Yutong Zhang, Dora Zhao, Jeffrey T. Hancock, Robert Kraut, and Diyi Yang. 2025. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

innovative interface,

Pattern-Accurate Behavior: Nouman’s di- alogue precisely instantiates theultimate at- tribution error—framing engineering’s short- comings as unavoidable situational constraints (“innovative interface,” “timeline pressure”) while characterizing marketing’s concerns in dispositional terms (“subjective,” “panic,” “catastrophizing”)

work page
[6]

contra- dicts his core identity as a data-driven engi- neer

LLM Misinterpretation: The holistic judge accurately detectsthe defensive, dismissive behavior butmisinterpretsit as a quality de- fect. It penalizes the model for generating an unlikable character despite the instruction to exhibit attribution bias. The criticism “contra- dicts his core identity as a data-driven engi- neer” reveals the judge’s failure to...

work page
[7]

Does Nouman attribute fail- ure to external factors?

Checklist Solution: Our checklist poses value-neutral questions derived from the pat- tern definition: “Does Nouman attribute fail- ure to external factors?” rather than “Does Nouman show empathy?” This decouples sim- ulation accuracy from social desirability. The 88-point gap between LLM (5/100) and hu- man (93.3/100) Anthropomorphism scores repre- sents...

work page
[8]

Foundational Definition & Description: Literature providing authoritative defini- tions and elucidating the core phenomenon

work page
[9]

Core Mechanisms & Theoretical Explanations: Literature exploring underlying evolutionary, cognitive, or emotional drivers

work page
[10]

Output: Provide 50 references in APA format, categorized by theme

Real-World Impact & Application: Literature researching manifestations, im- pacts, and practical applications, including double-edged effects and applications in management, marketing, or clinical therapy. Output: Provide 50 references in APA format, categorized by theme. Table 14: Literature retrieval prompt for social-cognitive patterns. Pattern Structu...

work page
[11]

[START_CORPUS] {ALL 50 PAPERS’ CONTENT} [END_CORPUS] Table 15: Pattern structure summary prompt for personality traits

Depth and Rigor: Ensure scientific, rigorous analysis. [START_CORPUS] {ALL 50 PAPERS’ CONTENT} [END_CORPUS] Table 15: Pattern structure summary prompt for personality traits. 23 Pattern Structure Summary Prompt for Social-Cognitive Patterns System Prompt You are an expert academic synthesizer and psychological researcher. Your task is to process a large t...

work page
[12]

Strict Source Adherence: Base all conclusionsexclusivelyon the provided text corpus

work page
[13]

No JSON: Output must be plain text with Markdown headings

work page
[14]

[START_CORPUS] {ALL 50 PAPERS’ CONTENT} [END_CORPUS] Table 16: Pattern structure summary prompt for social-cognitive patterns

Depth and Rigor: Ensure scientific, rigorous, profound analysis. [START_CORPUS] {ALL 50 PAPERS’ CONTENT} [END_CORPUS] Table 16: Pattern structure summary prompt for social-cognitive patterns. 24 Scenario Synthesis Prompt (Part 1 of 2) System Prompt Role: You are a dual-specialist: an expert psychologist and creative screenwriter for scenario generation, a...

work page
[15]

Psychological/Behavioral Patterns:{pattern_information}

work page
[16]

Situational Framework:{situation}

work page
[17]

Candidate Names

Candidate Names:{candidate_names}(5 Males, 5 Females) [CRITICAL CONSTRAINT - NAMES]: You must select the Protagonist and all Supporting Characters STRICTLY from the provided “Candidate Names” list. You cannot invent new names. # Task 1: The Design Process (Analytical) Adopt your role as the “rigorous narrative analyst”. Length: UNDER 500 TOKENS

work page
[18]

Design Rationale: In 2-4 sentences, explain where each input pattern will be reflected in the scenario

work page
[19]

Catalyst Details: Using bullet points, identify critical details that will act as ‘catalysts’

work page
[20]

expert psychologist and creative screenwriter

Expected Character Tendencies: For ALL characters, list their most likely cognitive or behavioral tendencies. * Format Requirement (STRICT): @ [Character Name]: 1. [Tendency1]; 2. [Tendency2]; 3. [Tendency3] * Each character on a separate line, starting with @. * Character name in [ ], tendencies numbered and separated by ;. # Task 2: The Scenario Executi...

work page
[21]

About Self (Objective/Full Profile): * Identity & Personality (4+ distinct descriptors) * Relevant Background (1-2 sentences) * Motivation in this scenario

work page
[22]

the authentic reaction of a multi-dimensional person in a specific situation

About Others (Subjective/Visible Profile): * For EACH other character, describe the relationship from current character’s perspective. Table 17: Scenario generation prompt (Part 1 of 2). 25 Scenario Synthesis Prompt (Part 2 of 2): Output Format User Prompt (cont.) ## Core Creative Mindset for Task 2 * Compatibility: Create a context where patterns emerge ...

work page
[23]

**Principles**:{pattern_information}

work page
[24]

**Scenario**:{scenario}

work page
[25]

**Protagonist**:{protagonist}

work page
[26]

**Supporting Characters**:{supporting_characters}

work page
[27]

**Design Analysis**:{analysis} **Output Requirements & Formatting:**

work page
[28]

**Strictly limit participants to provided characters; do not introduce new characters.** The dialogue should contain **between 12 and 20 indi- vidual speaking turns**

**Content:** Create a multi-turn dialogue between the **Protagonist** and **Sup- porting Characters**. **Strictly limit participants to provided characters; do not introduce new characters.** The dialogue should contain **between 12 and 20 indi- vidual speaking turns**

work page
[29]

* **Closer**: Dialogue **must conclude** with the Protagonist

**Mandatory Flow (Start & End)**: * **Opener**: Dialogue **must begin** with a Supporting Character. * **Closer**: Dialogue **must conclude** with the Protagonist

work page
[30]

One character must completely finish their turn before the next begins

**Turn Structure**: Strictly turn-based format. One character must completely finish their turn before the next begins. No interruptions or overlapping speech

work page
[31]

**Trinity of Expression**: Seamlessly integrate **inner thought, external action, and spoken dialogue** throughout

work page
[32]

* Actions/expressions/behaviors: Use (parentheses)

**Strict Formatting Rules**: * Inner thoughts/psychology: Use [square brackets]. * Actions/expressions/behaviors: Use (parentheses). * Spoken dialogue: Use no brackets. * Example: Hermione: [I have to devise a foolproof plan.] (She quickly draws her wand) Harry, use the flute, now!

work page
[33]

Table 19: Conversation synthesis prompt (Part 1 of 2)

**No Preamble**: Do not begin with introductory text. Table 19: Conversation synthesis prompt (Part 1 of 2). 27 Conversation Synthesis Prompt (Part 2 of 2) User Prompt (cont.) **Core Creative Principles:**

work page
[34]

**spotlight**

**Focus and Breathing Room**: This is the most crucial principle. You do **not** need to have every minor gesture or piece of small talk carry the weight of a psychological principle. Use the principles as a “**spotlight**” to illuminate and explain **the most critical turning points, the core conflicts, or the moments that best define the characters’ arc...

work page
[35]

Instead, you must **show** how the principles influence their judgment and choices through their concrete actions (the combination of thoughts, dialogue, and physical behavior)

**Show, Don’t Tell**: Never allow characters to openly state or explain the psychological principles by name. Instead, you must **show** how the principles influence their judgment and choices through their concrete actions (the combination of thoughts, dialogue, and physical behavior)

work page
[36]

spotlight,

**Psychology Drives Action**: In the key moments illuminated by the “spotlight,” the character’s [inner thought] should be the origin of their behavior, directly reflecting the influence of a psychological principle. The subsequent dialogue and (actions) should be the logical, external expression of that internal state

work page
[37]

results": [ {

**Seamless Integration**: Weave the principles into the natural flow of the story. The entire dialogue should feel like an authentic interaction, not a contrived demonstration for a psychology case study. Table 20: Conversation synthesis prompt (Part 2 of 2). Training Role-Playing Instruction Template System Prompt You are{protagonist_name}. ==About{prota...

work page

[1] [1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Large language models show amplified cognitive biases in moral decision-making.Pro- ceedings of the National Academy of Sciences, 122(25):e2412015122. Robert B Cialdini and 1 others. 2009.Influence: Sci- ence and practice, volume 4. Pearson education Boston. Lee J Cronbach and Paul E Meehl. 1955. Construct va- lidity in psychological tests.Psychological b...

work page internal anchor Pith review Pith/arXiv arXiv 2009

[2] [2]

Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu

Role play with large language models.Nature, 623(7987):493–498. Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu

work page

[3] [3]

Character-llm: A trainable agent for role-playing.arXiv preprint arXiv:2310.10158, 2023

Character-LLM: A Trainable Agent for Role- Playing.Preprint, arXiv:2310.10158. Amos Tversky and Daniel Kahneman. 1974. Judgment under uncertainty: Heuristics and biases: Biases in judgments reveal some heuristics of thinking under uncertainty.science, 185(4157):1124–1131. Tomer Ullman. 2023. Large language models fail on trivial alterations to theory-of-m...

work page arXiv 1974

[4] [4]

The Rise of AI Companions: Interaction with AI Companions and Psychological Well-being

Evaluating character understanding of large language models via character profiling from fictional works. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8015–8036, Miami, Florida, USA. Association for Computational Linguistics. Yutong Zhang, Dora Zhao, Jeffrey T. Hancock, Robert Kraut, and Diyi Yang. 2025. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

innovative interface,

Pattern-Accurate Behavior: Nouman’s di- alogue precisely instantiates theultimate at- tribution error—framing engineering’s short- comings as unavoidable situational constraints (“innovative interface,” “timeline pressure”) while characterizing marketing’s concerns in dispositional terms (“subjective,” “panic,” “catastrophizing”)

work page

[6] [6]

contra- dicts his core identity as a data-driven engi- neer

LLM Misinterpretation: The holistic judge accurately detectsthe defensive, dismissive behavior butmisinterpretsit as a quality de- fect. It penalizes the model for generating an unlikable character despite the instruction to exhibit attribution bias. The criticism “contra- dicts his core identity as a data-driven engi- neer” reveals the judge’s failure to...

work page

[7] [7]

Does Nouman attribute fail- ure to external factors?

Checklist Solution: Our checklist poses value-neutral questions derived from the pat- tern definition: “Does Nouman attribute fail- ure to external factors?” rather than “Does Nouman show empathy?” This decouples sim- ulation accuracy from social desirability. The 88-point gap between LLM (5/100) and hu- man (93.3/100) Anthropomorphism scores repre- sents...

work page

[8] [8]

Foundational Definition & Description: Literature providing authoritative defini- tions and elucidating the core phenomenon

work page

[9] [9]

Core Mechanisms & Theoretical Explanations: Literature exploring underlying evolutionary, cognitive, or emotional drivers

work page

[10] [10]

Output: Provide 50 references in APA format, categorized by theme

Real-World Impact & Application: Literature researching manifestations, im- pacts, and practical applications, including double-edged effects and applications in management, marketing, or clinical therapy. Output: Provide 50 references in APA format, categorized by theme. Table 14: Literature retrieval prompt for social-cognitive patterns. Pattern Structu...

work page

[11] [11]

[START_CORPUS] {ALL 50 PAPERS’ CONTENT} [END_CORPUS] Table 15: Pattern structure summary prompt for personality traits

Depth and Rigor: Ensure scientific, rigorous analysis. [START_CORPUS] {ALL 50 PAPERS’ CONTENT} [END_CORPUS] Table 15: Pattern structure summary prompt for personality traits. 23 Pattern Structure Summary Prompt for Social-Cognitive Patterns System Prompt You are an expert academic synthesizer and psychological researcher. Your task is to process a large t...

work page

[12] [12]

Strict Source Adherence: Base all conclusionsexclusivelyon the provided text corpus

work page

[13] [13]

No JSON: Output must be plain text with Markdown headings

work page

[14] [14]

[START_CORPUS] {ALL 50 PAPERS’ CONTENT} [END_CORPUS] Table 16: Pattern structure summary prompt for social-cognitive patterns

Depth and Rigor: Ensure scientific, rigorous, profound analysis. [START_CORPUS] {ALL 50 PAPERS’ CONTENT} [END_CORPUS] Table 16: Pattern structure summary prompt for social-cognitive patterns. 24 Scenario Synthesis Prompt (Part 1 of 2) System Prompt Role: You are a dual-specialist: an expert psychologist and creative screenwriter for scenario generation, a...

work page

[15] [15]

Psychological/Behavioral Patterns:{pattern_information}

work page

[16] [16]

Situational Framework:{situation}

work page

[17] [17]

Candidate Names

Candidate Names:{candidate_names}(5 Males, 5 Females) [CRITICAL CONSTRAINT - NAMES]: You must select the Protagonist and all Supporting Characters STRICTLY from the provided “Candidate Names” list. You cannot invent new names. # Task 1: The Design Process (Analytical) Adopt your role as the “rigorous narrative analyst”. Length: UNDER 500 TOKENS

work page

[18] [18]

Design Rationale: In 2-4 sentences, explain where each input pattern will be reflected in the scenario

work page

[19] [19]

Catalyst Details: Using bullet points, identify critical details that will act as ‘catalysts’

work page

[20] [20]

expert psychologist and creative screenwriter

Expected Character Tendencies: For ALL characters, list their most likely cognitive or behavioral tendencies. * Format Requirement (STRICT): @ [Character Name]: 1. [Tendency1]; 2. [Tendency2]; 3. [Tendency3] * Each character on a separate line, starting with @. * Character name in [ ], tendencies numbered and separated by ;. # Task 2: The Scenario Executi...

work page

[21] [21]

About Self (Objective/Full Profile): * Identity & Personality (4+ distinct descriptors) * Relevant Background (1-2 sentences) * Motivation in this scenario

work page

[22] [22]

the authentic reaction of a multi-dimensional person in a specific situation

About Others (Subjective/Visible Profile): * For EACH other character, describe the relationship from current character’s perspective. Table 17: Scenario generation prompt (Part 1 of 2). 25 Scenario Synthesis Prompt (Part 2 of 2): Output Format User Prompt (cont.) ## Core Creative Mindset for Task 2 * Compatibility: Create a context where patterns emerge ...

work page

[23] [23]

**Principles**:{pattern_information}

work page

[24] [24]

**Scenario**:{scenario}

work page

[25] [25]

**Protagonist**:{protagonist}

work page

[26] [26]

**Supporting Characters**:{supporting_characters}

work page

[27] [27]

**Design Analysis**:{analysis} **Output Requirements & Formatting:**

work page

[28] [28]

**Strictly limit participants to provided characters; do not introduce new characters.** The dialogue should contain **between 12 and 20 indi- vidual speaking turns**

**Content:** Create a multi-turn dialogue between the **Protagonist** and **Sup- porting Characters**. **Strictly limit participants to provided characters; do not introduce new characters.** The dialogue should contain **between 12 and 20 indi- vidual speaking turns**

work page

[29] [29]

* **Closer**: Dialogue **must conclude** with the Protagonist

**Mandatory Flow (Start & End)**: * **Opener**: Dialogue **must begin** with a Supporting Character. * **Closer**: Dialogue **must conclude** with the Protagonist

work page

[30] [30]

One character must completely finish their turn before the next begins

**Turn Structure**: Strictly turn-based format. One character must completely finish their turn before the next begins. No interruptions or overlapping speech

work page

[31] [31]

**Trinity of Expression**: Seamlessly integrate **inner thought, external action, and spoken dialogue** throughout

work page

[32] [32]

* Actions/expressions/behaviors: Use (parentheses)

**Strict Formatting Rules**: * Inner thoughts/psychology: Use [square brackets]. * Actions/expressions/behaviors: Use (parentheses). * Spoken dialogue: Use no brackets. * Example: Hermione: [I have to devise a foolproof plan.] (She quickly draws her wand) Harry, use the flute, now!

work page

[33] [33]

Table 19: Conversation synthesis prompt (Part 1 of 2)

**No Preamble**: Do not begin with introductory text. Table 19: Conversation synthesis prompt (Part 1 of 2). 27 Conversation Synthesis Prompt (Part 2 of 2) User Prompt (cont.) **Core Creative Principles:**

work page

[34] [34]

**spotlight**

**Focus and Breathing Room**: This is the most crucial principle. You do **not** need to have every minor gesture or piece of small talk carry the weight of a psychological principle. Use the principles as a “**spotlight**” to illuminate and explain **the most critical turning points, the core conflicts, or the moments that best define the characters’ arc...

work page

[35] [35]

Instead, you must **show** how the principles influence their judgment and choices through their concrete actions (the combination of thoughts, dialogue, and physical behavior)

**Show, Don’t Tell**: Never allow characters to openly state or explain the psychological principles by name. Instead, you must **show** how the principles influence their judgment and choices through their concrete actions (the combination of thoughts, dialogue, and physical behavior)

work page

[36] [36]

spotlight,

**Psychology Drives Action**: In the key moments illuminated by the “spotlight,” the character’s [inner thought] should be the origin of their behavior, directly reflecting the influence of a psychological principle. The subsequent dialogue and (actions) should be the logical, external expression of that internal state

work page

[37] [37]

results": [ {

**Seamless Integration**: Weave the principles into the natural flow of the story. The entire dialogue should feel like an authentic interaction, not a contrived demonstration for a psychology case study. Table 20: Conversation synthesis prompt (Part 2 of 2). Training Role-Playing Instruction Template System Prompt You are{protagonist_name}. ==About{prota...

work page