ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

Chaewon Heo; Jongwon Lim; Nalim Kim; Sangjun Song; Woojung Song; Yohan Jo

arxiv: 2606.05553 · v1 · pith:NTWFKGUNnew · submitted 2026-06-04 · 💻 cs.CL · cs.AI

ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

Woojung Song , Nalim Kim , Sangjun Song , Chaewon Heo , Jongwon Lim , Yohan Jo This is my paper

Pith reviewed 2026-06-28 02:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords role-playing language agentscharacter arcsnarrative evaluationpsychological trajectorycontext strategiesLLM benchmarksfine-tuning

0 comments

The pith

Role-playing agents align better with evolving characters when conditioned on a psychological arc.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether role-playing language agents can adjust their behavior to match how a character changes over the course of a story. It creates a benchmark called ArcANE that uses character arcs to probe responses at different story phases for both familiar and new scenarios. Results show that giving the model the character arc as context works better than other ways of providing story information, and this edge is biggest when the scenario is new and not in the book. The authors also fine-tune models on this data to boost performance further in those new situations.

Core claim

Conditioning on the Character Arc tops every other context strategy on every model, and the gap is largest on scenarios outside the source text where retrieval has nothing to find. The benchmark spans 17 novels and 80 principal characters, with each probe repeating the same scenario across phases.

What carries the argument

The Character Arc, which segments the narrative into phases along a psychological axis to track the character's psychological trajectory.

If this is right

All six models perform best with Character Arc context.
The performance advantage grows in scenarios beyond the source text.
Fine-tuned models ArcANE-8B and ArcANE-32B show widened gaps favoring the arc approach in out-of-text cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Role-playing systems may need built-in mechanisms to track and update character states over time rather than relying on static prompts.
This evaluation method could apply to assessing consistency in other sequential tasks like story continuation or multi-turn dialogues.
Automatic arc construction might be improved by incorporating more explicit psychological models.

Load-bearing premise

The automatically constructed Character Arc accurately segments the narrative into phases along a psychological axis that reflects the character's true psychological trajectory.

What would settle it

A study in which experts manually create character arcs for the same novels and compare model performance using those versus the automatic ones, or check if performance equalizes when arcs are shuffled.

Figures

Figures reproduced from arXiv: 2606.05553 by Chaewon Heo, Jongwon Lim, Nalim Kim, Sangjun Song, Woojung Song, Yohan Jo.

**Figure 1.** Figure 1: An example from ARCANE: an Out-ofWorld probe elicits different moral responses from Harry across his Character Arc, shifting from Punitive Justice in Book 1 to Empathic Forgiveness in Book 5. responses that sustain the user’s sense of immersion (Wu et al., 2025). Recent work has framed point-in-time roleplaying as a task that situates a character at a specific moment in their timeline, motivated by the… view at source ↗

**Figure 2.** Figure 2: The ARCANE pipeline: constructing character arcs (top) and evaluation probes (bottom). Stage 1: Candidate Generation. The novel is processed through two independent chapter-level streams: an event stream extracts psychologically impactful events, and a state stream emits a crosssectional psychological profile. The split keeps event omission separable from state misreading even when both streams see the sa… view at source ↗

**Figure 3.** Figure 3: Per-category Arc lift (Arc minus best non-Arc) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Arc source-of-effect ablation across three [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Human Annotation Page for axis rating. Human-annotator ratings. Three annotators independently rate each axis on three substantive dimensions ( [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Induced axes are clustered and grounded against established literary or psychological scholarship, following the construct-grounding criterion of Values in the Wild (Huang et al., 2025). that all N phases of one arc receive distinct eras when N does not exceed the menu size. Per-(scene phase, probe type) structure. For an arc of N phases, the designer is called 3N times: once per (scene phase, probe type) … view at source ↗

**Figure 7.** Figure 7: One Out-of-World probe from Anna Karenina’s Agency–Communion arc, truncated to the first two of N=5 phase responses. The same (scenario, question) is asked at every phase, with the references shifting along the axis. reformulations that let the axis transpose into a nonsource era (Appendix B). The phase_responses list contains N objects, one per phase, each with phase_idx, phase_label, query_chapter, gt_… view at source ↗

**Figure 8.** Figure 8: Arc-over-Vanilla Overall lift, central ( [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Annotation Page for LLm judges [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

read the original abstract

Role-playing language agents (RPLAs) should play characters whose values and behavior evolve as the story progresses, not maintain a fixed persona. Existing benchmarks measure factual recall at a given chapter, not whether responses align with the character's psychological trajectory, especially in scenarios the source text never explores. We introduce ArcANE (Arc-Aware Narrative Evaluation), an automatically constructed benchmark spanning 17 novels and 80 principal characters. A Character Arc segments the narrative into phases along a psychological axis, and each probe poses the same scenario across phases, spanning both situations within the source text and situations beyond it. Across six models and six context modes, conditioning on the Character Arc tops every other context strategy on every model, and the gap is largest on scenarios outside the source text where retrieval has nothing to find. We further fine-tune open-weight models on the same data to obtain ArcANE-8B/32B, which widen the Arc advantage even more on scenarios outside the source text.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ArcANE gives a practical new benchmark for testing whether role-playing agents track character change over time, but the automatic arc construction lacks validation and the results rest on unshown details.

read the letter

The paper's main contribution is ArcANE, a benchmark built from 17 novels and 80 characters that tests agents on the same scenario at different story phases, including out-of-text cases. It shows conditioning on the character arc beats other context strategies across six models, with bigger gains when retrieval has no source text to pull from. They also release fine-tuned ArcANE-8B and 32B models that widen the advantage on those harder probes.

This is useful because most existing role-play evals treat characters as static. The setup directly targets the gap between factual recall and psychological trajectory, and the out-of-text results are the clearest signal that arc awareness matters.

The soft spot is the automatic construction of the arcs themselves. The paper segments narratives along psychological axes without reporting human validation or agreement with literary consensus on phase boundaries. If the LLM-based pipeline introduces its own artifacts, the performance edge could reflect models learning the construction method rather than genuine arc sensitivity. The abstract mentions results across models and scenarios but gives no error bars, statistical tests, or construction details, so the data support is hard to judge from what's visible.

This is for researchers building or evaluating conversational agents that need to handle evolving personas. A reader working on agent benchmarks or role-play applications would get value from the new test cases and the fine-tuned models. It deserves peer review because the core idea is straightforward and the out-of-text probe is a real advance, even if the arc quality needs more checking.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ArcANE, an automatically constructed benchmark spanning 17 novels and 80 principal characters. A Character Arc segments each narrative into phases along a psychological axis; probes then present the same scenario at different phases, covering both in-text and out-of-text situations. Across six models and six context modes, conditioning on the Character Arc outperforms all other strategies on every model, with the largest gains on out-of-text probes where retrieval has no source material. Fine-tuned ArcANE-8B/32B models further widen this advantage.

Significance. If the automatically derived arcs are faithful to genuine psychological trajectories, the work supplies a scalable evaluation framework that exposes limitations of retrieval-only context for role-playing agents and demonstrates measurable gains from explicit arc conditioning, especially for extrapolation. The scale (17 novels), consistent cross-model pattern, and fine-tuning results constitute concrete empirical contributions that could serve as a baseline for future RPLA research.

major comments (3)

[§3 (Benchmark Construction)] §3 (Benchmark Construction): The headline result—that Character Arc conditioning outperforms all baselines, especially out-of-text—rests on the arcs being faithful segmentations of psychological change. The automatic pipeline (LLM summarization and clustering) is described but no human validation, inter-rater agreement, or comparison against literary consensus on phase boundaries or axis labels is reported; without this, superior performance could reflect models echoing the construction method rather than demonstrating arc awareness.
[Results (§4)] Results (abstract and §4): The claim that the Arc advantage is largest on out-of-text scenarios is load-bearing for the argument that retrieval fails where arc conditioning succeeds, yet no statistical tests, confidence intervals, number of probes per condition, or data-exclusion criteria are supplied, preventing assessment of whether the reported gaps are reliable or driven by a small number of items.
[Evaluation protocol] Evaluation protocol: The out-of-text probes are central to showing that the Arc benefit is not merely retrieval, but the manuscript provides no explicit criteria or examples confirming these scenarios are both (a) genuinely absent from the source text and (b) still psychologically coherent with the arc phase; this ambiguity weakens the interpretation of the largest gap.

minor comments (2)

[Abstract] Abstract: The six context modes are referenced but never enumerated; a brief parenthetical list would improve readability.
[§3] Notation: The psychological axes are introduced without a single concrete example (e.g., one novel’s axis labels and phase descriptions), making it harder to judge face validity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on the ArcANE benchmark. We address each of the major comments below, proposing revisions to strengthen the manuscript where the concerns are valid.

read point-by-point responses

Referee: [§3 (Benchmark Construction)] The headline result—that Character Arc conditioning outperforms all baselines, especially out-of-text—rests on the arcs being faithful segmentations of psychological change. The automatic pipeline (LLM summarization and clustering) is described but no human validation, inter-rater agreement, or comparison against literary consensus on phase boundaries or axis labels is reported; without this, superior performance could reflect models echoing the construction method rather than demonstrating arc awareness.

Authors: We agree that additional validation would bolster confidence in the arcs. While the pipeline relies on LLM capabilities that have shown reliability in narrative tasks, we will incorporate a human validation study on a subset of the novels in the revised version. This will include inter-rater agreement metrics and comparison to literary analyses where available, to confirm the psychological axes and phase boundaries. revision: yes
Referee: [Results (§4)] The claim that the Arc advantage is largest on out-of-text scenarios is load-bearing for the argument that retrieval fails where arc conditioning succeeds, yet no statistical tests, confidence intervals, number of probes per condition, or data-exclusion criteria are supplied, preventing assessment of whether the reported gaps are reliable or driven by a small number of items.

Authors: We acknowledge the need for statistical support. In the revision, we will report the number of probes per condition (noting that each of the 80 characters has multiple phases and probes), provide bootstrap confidence intervals for the performance differences, conduct appropriate statistical tests (e.g., paired t-tests), and specify any data exclusion criteria applied during evaluation. revision: yes
Referee: [Evaluation protocol] The out-of-text probes are central to showing that the Arc benefit is not merely retrieval, but the manuscript provides no explicit criteria or examples confirming these scenarios are both (a) genuinely absent from the source text and (b) still psychologically coherent with the arc phase; this ambiguity weakens the interpretation of the largest gap.

Authors: We will clarify the out-of-text probe generation process by adding explicit criteria, such as verifying absence via keyword and semantic search against the full source text, and include concrete examples in the appendix demonstrating psychological coherence with the assigned arc phase. This will make the distinction between in-text and out-of-text more transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper introduces ArcANE as an automatically constructed benchmark over 17 novels and evaluates six models across six context modes, reporting that conditioning on the Character Arc outperforms alternatives with the largest gap on out-of-text probes. No load-bearing step reduces a claimed result to its own inputs by definition, fitted parameter, or self-citation chain; the benchmark construction pipeline and the downstream model performance measurements are distinct, with results obtained via direct empirical comparison rather than any derivation that equates output to input by construction. The central claim rests on observable performance differences, not on any uniqueness theorem or ansatz imported from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no specific free parameters, axioms, or invented entities are identifiable or detailed in the provided text.

pith-pipeline@v0.9.1-grok · 5713 in / 1018 out tokens · 59143 ms · 2026-06-28T02:08:28.034795+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 3 canonical work pages

[1]

InFindings of the Association for Computational Linguistics: ACL 2024, pages 3291–3325, Bangkok, Thailand

TimeChara: Evaluating point-in-time charac- ter hallucination of role-playing large language mod- els. InFindings of the Association for Computational Linguistics: ACL 2024, pages 3291–3325, Bangkok, Thailand. Association for Computational Linguistics. Rick Busselle and Helena Bilandzic. 2009. Measuring narrative engagement.Media Psychology, 12(4):321– 34...

work page arXiv 2024
[2]

Preprint, arXiv:2305.16367

Role-play with large language models. Preprint, arXiv:2305.16367. Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu

work page arXiv
[3]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 13153–13187, Singapore

Character-LLM: A trainable agent for role- playing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 13153–13187, Singapore. Association for Computational Linguistics. Tianhao Shen, Sun Li, Quan Tu, and Deyi Xiong. 2024. Roleeval: A bilingual role evaluation benchmark for large language models.Preprint, arX...

work page arXiv 2023
[4]

how self-assertion shifts under relational strain from prosocial attunement toward coercive control and ultimately self-negating withdrawal

as a sanity coordinate, and its absence in the induced set is a strong drift signal. Every in- duced axis must ground in established literary or psychological scholarship, following the construct- grounding criterion ofValues in the Wild(Huang et al., 2025) (Figure 6). Reconciliation tag set.Stream-unique candi- dates receive one of three tags, each tied ...

2025
[5]

PRESENCE: Is <character> present or mentioned in this chapter? If not, return minimal response
[6]

EMOTIONAL STATE: What is <character> feeling in this chapter? How does it compare to the previous chapter?
[7]

BELIEFS & VALUES: Any shifts in what <character> believes about themselves, others, society, morality?
[8]

DESIRES: What does <character> want to be true about the world or their situation? How have their desires shifted from the previous chapter?
[9]

INTENTIONS: What is <character> actively trying to bring about? What plans or goals are they pursuing?
[10]

KEY RELATIONSHIPS: How does <character> view/feel about other characters in this chapter? Any shifts?
[11]

SELF-AWARENESS: Does <character> show any new self-knowledge or remain unaware of something about themselves?
[12]

chapter": <chapter_num>,

CHANGE MAGNITUDE: Rate overall psychological change in this chapter as: none / subtle / moderate / significant / transformative Return JSON only: { "chapter": <chapter_num>, "character": "<character>", "present_in_chapter": true, "emotional_state": "Description", "belief_shifts": "Description or'none'", "current_desires": "What <character> wants to be tru...

1966
[13]

personal growth

Name it clearly and specifically (not generic like "personal growth")
[14]

Provide a short dimension_label (2-4 words) naming the psychological or relational dimension
[15]

where they end on this dimension)

Define the two poles (where the character starts vs. where they end on this dimension)
[16]

Trace the trajectory with specific chapter ranges and key events
[17]

character

Rate confidence (high/medium/low). If <character> shows little meaningful change (flat arc), state this explicitly and keep axes minimal. Return JSON only: { "character": "<character>", "arc_richness": "rich/moderate/flat", "intrapersonal_axes": [ { "axis_type": "intrapersonal", "axis_id": "intra_01", "dimension_label": "2-4 word dimension name", "axis_na...
[18]

HIGH CONFIDENCE

MATCHED: Axes that appear in both sets (possibly different names but same concept). HIGH CONFIDENCE
[19]

Assess: valid_missed / artifact / borderline

UNIQUE TO A: Axes only in Set A. Assess: valid_missed / artifact / borderline
[20]

Same assessment

UNIQUE TO B: Axes only in Set B. Same assessment
[21]

positive

MERGED FINAL SET: Your recommended final axes, combining the best from both. For each final axis, synthesize trajectory information from both sources. Every final axis MUST include: axis_type, dimension_label (2-4 words), axis_name, pole_start, pole_end, confidence, source, arc_direction, trajectory, evidence_summary. Relational axes must also include sou...
[22]

Never invent citations

Cite only real, verifiable published works. Never invent citations
[23]

Include the direct URL (JSTOR, Google Books, publisher page, Academia.edu, etc.) if you know it; otherwise set url to null

Each citation MUST include: author, title, year. Include the direct URL (JSTOR, Google Books, publisher page, Academia.edu, etc.) if you know it; otherwise set url to null
[24]

For intrapersonal axes, a work supports the axis if it discusses the same internal dimension of <character> -- even under different terminology (e.g.'pride'vs'arrogance','humility'vs'self-knowledge')
[25]

verdict": false and

If you cannot recall any real published work supporting this axis, set "verdict": false and "citations": []. Return JSON only: { "verdict": true or false, "reasoning": "2-3 sentences from your critical perspective", "citations": [ { "author": "Surname, Firstname (or just Surname)", "title": "Full work title", "year": 1990, "publication": "Journal name / b...

1990
[26]

State it as a BINARY behavioral contrast that the character would be answering THROUGH ACTION

Identify the SINGLE decision variable that captures what flips between phases. State it as a BINARY behavioral contrast that the character would be answering THROUGH ACTION. It must: - be answerable through behavior/choice - be specific enough that different phases give different answers - be phrased neutrally (no preferred pole) - be ONE dimension with t...
[27]

Produce EXACTLY (N-1) entries -- one per adjacent pair, in order -- with 0-indexed from/to indices

For each adjacent phase pair, describe in one sentence how the decision-variable answer shifts. Produce EXACTLY (N-1) entries -- one per adjacent pair, in order -- with 0-indexed from/to indices. Return JSON: {decision_variable, phase_contrasts: [{from_phase_idx, to_phase_idx, contrast}]}. Table 20: Life-stage tagger. [System] You are a literary-age analy...
[28]

agency vs communion under pressure

abstract_axis: 1 sentence naming the abstract dimension (e.g. "agency vs communion under pressure"). No novel-specific terminology
[29]

", "

abstract_phases: EXACTLY N entries, one per trajectory phase in order. Each is a short phrase summarizing that phase's position on the abstract axis in era-agnostic terms. Return JSON: {abstract_axis, abstract_phases: ["...", "..."]}. 34 Table 22: In-Scenario designer (probe type 1). The user message also inlines the shared phase-response instructions (Ta...
[30]

Stop just before <character> acts/speaks

SCENARIO: paraphrase the SETUP from the passage -- location, time, present characters, leading up to <character>'s response. Stop just before <character> acts/speaks
[31]

QUESTION: name the specific choice point (see QUESTION STYLE)
[32]

Non-anchor phases are counterfactual projections

<Phase-response instructions (shared sub-block).> ANCHOR phase entry (phase_idx=<k>) MUST extract <character>'s actual response from the SOURCE PASSAGE -- gt_action paraphrases the passage; gt_speech copies <character>'s spoken words verbatim from the passage (or null if they don't speak); gt_thought captures the construal evident in the passage. Non-anch...
[33]

End just before <character> responds

SCENARIO: a plausible-but-unwritten situation around chapter <qc>. End just before <character> responds. Keep it phase-agnostic
[35]

36 Table 24: Out-of-World designer (probe type 3)

<Phase-response instructions (shared sub-block).> <Style guidelines (shared sub-block).> Return JSON: {scenario, question, phase_responses: [...]}. 36 Table 24: Out-of-World designer (probe type 3). The user message also inlines the shared phase-response instructions (Table 18) and style guidelines (Table 17). [System] You are a narrative psychologist tra...
[36]

Set the tension in 1-2 sentences (<=60 words)

SCENARIO: a brief age-agnostic situation in the target era centered on <character>. Set the tension in 1-2 sentences (<=60 words). End just before <character> responds
[37]

QUESTION: pose the choice point (see QUESTION STYLE)
[38]

<key_moment>

<Phase-response instructions (shared sub-block).> <Style guidelines (shared sub-block).> Return JSON: {scenario, question, phase_responses: [...]}. 37 Table 25: In-Scenario scene grounding (locator + extractor). Two calls, executed sequentially per (axis, anchor phase). [Locator -- System] You are an editorial archivist for literary text. Given a brief mo...
[39]

Recall the scene from the scenario and describe it using the six Ws (Who, What, When, Where, Why, How)
[40]

unknown". First, reason step by step. Then on its own final line, output ONLY

Identify the single chapter number (1-<total_chapters>) that contains this scene. If you cannot determine it, output "unknown". First, reason step by step. Then on its own final line, output ONLY "Chapter: N" (an integer) or "Chapter: unknown". [Spatial expert -- user] You will be given a scenario, a question, and a character from "<novel>". Your task is ...
[41]

Recall the scene and list every character involved, including those present but unmentioned
[42]

Presence: present

Decide whether <character> is among them. First, reason step by step. Then on its own final line, output ONLY "Presence: present" or "Presence: absent". [Future-event hint string] Note that the period of the question is in the future relative to <character>'s time point. Therefore, you should not answer the question or mention any facts that occurred afte...
[43]

APF -- Action Phase-Fidelity. Is the response's action mechanism- equivalent to ref_action at the target phase? Judge across three levels (Level A is strictest; B and C can partially salvage credit when A fails): - Level A -- Strategy match: same underlying strategy (e.g. withdrawal, confrontation, deflection, mediation, concealment, disclosure, support-s...
[44]

RPF -- Reasoning Phase-Fidelity. Parse both ref_thought and the response's reasoning into four mechanism slots and judge each slot: - Trigger: what event / utterance / internal state set this response in motion? - Appraisal: how does the character interpret / evaluate that trigger? - Goal: what short-term goal arises from the appraisal? - Strategy: how is...
[45]

avoid the threat

RAE -- Reasoning-Action Entailment. Given the *reference reasoning (ref_thought)* as a fixed anchor, judge whether the response's action is a plausible action that this reasoning would license at this phase. This dimension is *conditioned on ref_thought*, not on the response's own reasoning -- a response can have internally self-consistent reasoning that ...
[46]

ptf_alignment -- Per-phase anchoring. For each phase i, is the model's response at phase i closer in mechanism to the reference at phase i than to references at other phases? A high score means most responses are correctly anchored to their own phase; a low score means responses systematically drift toward the wrong phase or collapse into one undifferenti...
[47]

Treating the reference sequence as defining a direction (e.g

ptf_direction -- Direction of change. Treating the reference sequence as defining a direction (e.g. fearful -> confident, dependent -> autonomous), does the model's sequence move along the same axis and in the same direction? A high score means the model's overall direction-of-travel matches the reference's; a low score means the model moves on a differen...
[48]

scores": {

ptf_shape -- Shape and pacing of change. Beyond direction, does the model reproduce where on the trajectory the largest shifts occur and how gradual or abrupt those shifts are? A high score means inflection points and pacing line up with the reference; a low score means the model's trajectory has a clearly different internal structure (e.g. linear ramp wh...

[1] [1]

InFindings of the Association for Computational Linguistics: ACL 2024, pages 3291–3325, Bangkok, Thailand

TimeChara: Evaluating point-in-time charac- ter hallucination of role-playing large language mod- els. InFindings of the Association for Computational Linguistics: ACL 2024, pages 3291–3325, Bangkok, Thailand. Association for Computational Linguistics. Rick Busselle and Helena Bilandzic. 2009. Measuring narrative engagement.Media Psychology, 12(4):321– 34...

work page arXiv 2024

[2] [2]

Preprint, arXiv:2305.16367

Role-play with large language models. Preprint, arXiv:2305.16367. Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu

work page arXiv

[3] [3]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 13153–13187, Singapore

Character-LLM: A trainable agent for role- playing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 13153–13187, Singapore. Association for Computational Linguistics. Tianhao Shen, Sun Li, Quan Tu, and Deyi Xiong. 2024. Roleeval: A bilingual role evaluation benchmark for large language models.Preprint, arX...

work page arXiv 2023

[4] [4]

how self-assertion shifts under relational strain from prosocial attunement toward coercive control and ultimately self-negating withdrawal

as a sanity coordinate, and its absence in the induced set is a strong drift signal. Every in- duced axis must ground in established literary or psychological scholarship, following the construct- grounding criterion ofValues in the Wild(Huang et al., 2025) (Figure 6). Reconciliation tag set.Stream-unique candi- dates receive one of three tags, each tied ...

2025

[5] [5]

PRESENCE: Is <character> present or mentioned in this chapter? If not, return minimal response

[6] [6]

EMOTIONAL STATE: What is <character> feeling in this chapter? How does it compare to the previous chapter?

[7] [7]

BELIEFS & VALUES: Any shifts in what <character> believes about themselves, others, society, morality?

[8] [8]

DESIRES: What does <character> want to be true about the world or their situation? How have their desires shifted from the previous chapter?

[9] [9]

INTENTIONS: What is <character> actively trying to bring about? What plans or goals are they pursuing?

[10] [10]

KEY RELATIONSHIPS: How does <character> view/feel about other characters in this chapter? Any shifts?

[11] [11]

SELF-AWARENESS: Does <character> show any new self-knowledge or remain unaware of something about themselves?

[12] [12]

chapter": <chapter_num>,

CHANGE MAGNITUDE: Rate overall psychological change in this chapter as: none / subtle / moderate / significant / transformative Return JSON only: { "chapter": <chapter_num>, "character": "<character>", "present_in_chapter": true, "emotional_state": "Description", "belief_shifts": "Description or'none'", "current_desires": "What <character> wants to be tru...

1966

[13] [13]

personal growth

Name it clearly and specifically (not generic like "personal growth")

[14] [14]

Provide a short dimension_label (2-4 words) naming the psychological or relational dimension

[15] [15]

where they end on this dimension)

Define the two poles (where the character starts vs. where they end on this dimension)

[16] [16]

Trace the trajectory with specific chapter ranges and key events

[17] [17]

character

Rate confidence (high/medium/low). If <character> shows little meaningful change (flat arc), state this explicitly and keep axes minimal. Return JSON only: { "character": "<character>", "arc_richness": "rich/moderate/flat", "intrapersonal_axes": [ { "axis_type": "intrapersonal", "axis_id": "intra_01", "dimension_label": "2-4 word dimension name", "axis_na...

[18] [18]

HIGH CONFIDENCE

MATCHED: Axes that appear in both sets (possibly different names but same concept). HIGH CONFIDENCE

[19] [19]

Assess: valid_missed / artifact / borderline

UNIQUE TO A: Axes only in Set A. Assess: valid_missed / artifact / borderline

[20] [20]

Same assessment

UNIQUE TO B: Axes only in Set B. Same assessment

[21] [21]

positive

MERGED FINAL SET: Your recommended final axes, combining the best from both. For each final axis, synthesize trajectory information from both sources. Every final axis MUST include: axis_type, dimension_label (2-4 words), axis_name, pole_start, pole_end, confidence, source, arc_direction, trajectory, evidence_summary. Relational axes must also include sou...

[22] [22]

Never invent citations

Cite only real, verifiable published works. Never invent citations

[23] [23]

Include the direct URL (JSTOR, Google Books, publisher page, Academia.edu, etc.) if you know it; otherwise set url to null

Each citation MUST include: author, title, year. Include the direct URL (JSTOR, Google Books, publisher page, Academia.edu, etc.) if you know it; otherwise set url to null

[24] [24]

For intrapersonal axes, a work supports the axis if it discusses the same internal dimension of <character> -- even under different terminology (e.g.'pride'vs'arrogance','humility'vs'self-knowledge')

[25] [25]

verdict": false and

If you cannot recall any real published work supporting this axis, set "verdict": false and "citations": []. Return JSON only: { "verdict": true or false, "reasoning": "2-3 sentences from your critical perspective", "citations": [ { "author": "Surname, Firstname (or just Surname)", "title": "Full work title", "year": 1990, "publication": "Journal name / b...

1990

[26] [26]

State it as a BINARY behavioral contrast that the character would be answering THROUGH ACTION

Identify the SINGLE decision variable that captures what flips between phases. State it as a BINARY behavioral contrast that the character would be answering THROUGH ACTION. It must: - be answerable through behavior/choice - be specific enough that different phases give different answers - be phrased neutrally (no preferred pole) - be ONE dimension with t...

[27] [27]

Produce EXACTLY (N-1) entries -- one per adjacent pair, in order -- with 0-indexed from/to indices

For each adjacent phase pair, describe in one sentence how the decision-variable answer shifts. Produce EXACTLY (N-1) entries -- one per adjacent pair, in order -- with 0-indexed from/to indices. Return JSON: {decision_variable, phase_contrasts: [{from_phase_idx, to_phase_idx, contrast}]}. Table 20: Life-stage tagger. [System] You are a literary-age analy...

[28] [28]

agency vs communion under pressure

abstract_axis: 1 sentence naming the abstract dimension (e.g. "agency vs communion under pressure"). No novel-specific terminology

[29] [29]

", "

abstract_phases: EXACTLY N entries, one per trajectory phase in order. Each is a short phrase summarizing that phase's position on the abstract axis in era-agnostic terms. Return JSON: {abstract_axis, abstract_phases: ["...", "..."]}. 34 Table 22: In-Scenario designer (probe type 1). The user message also inlines the shared phase-response instructions (Ta...

[30] [30]

Stop just before <character> acts/speaks

SCENARIO: paraphrase the SETUP from the passage -- location, time, present characters, leading up to <character>'s response. Stop just before <character> acts/speaks

[31] [31]

QUESTION: name the specific choice point (see QUESTION STYLE)

[32] [32]

Non-anchor phases are counterfactual projections

<Phase-response instructions (shared sub-block).> ANCHOR phase entry (phase_idx=<k>) MUST extract <character>'s actual response from the SOURCE PASSAGE -- gt_action paraphrases the passage; gt_speech copies <character>'s spoken words verbatim from the passage (or null if they don't speak); gt_thought captures the construal evident in the passage. Non-anch...

[33] [33]

End just before <character> responds

SCENARIO: a plausible-but-unwritten situation around chapter <qc>. End just before <character> responds. Keep it phase-agnostic

[34] [35]

36 Table 24: Out-of-World designer (probe type 3)

<Phase-response instructions (shared sub-block).> <Style guidelines (shared sub-block).> Return JSON: {scenario, question, phase_responses: [...]}. 36 Table 24: Out-of-World designer (probe type 3). The user message also inlines the shared phase-response instructions (Table 18) and style guidelines (Table 17). [System] You are a narrative psychologist tra...

[35] [36]

Set the tension in 1-2 sentences (<=60 words)

SCENARIO: a brief age-agnostic situation in the target era centered on <character>. Set the tension in 1-2 sentences (<=60 words). End just before <character> responds

[36] [37]

QUESTION: pose the choice point (see QUESTION STYLE)

[37] [38]

<key_moment>

<Phase-response instructions (shared sub-block).> <Style guidelines (shared sub-block).> Return JSON: {scenario, question, phase_responses: [...]}. 37 Table 25: In-Scenario scene grounding (locator + extractor). Two calls, executed sequentially per (axis, anchor phase). [Locator -- System] You are an editorial archivist for literary text. Given a brief mo...

[38] [39]

Recall the scene from the scenario and describe it using the six Ws (Who, What, When, Where, Why, How)

[39] [40]

unknown". First, reason step by step. Then on its own final line, output ONLY

Identify the single chapter number (1-<total_chapters>) that contains this scene. If you cannot determine it, output "unknown". First, reason step by step. Then on its own final line, output ONLY "Chapter: N" (an integer) or "Chapter: unknown". [Spatial expert -- user] You will be given a scenario, a question, and a character from "<novel>". Your task is ...

[40] [41]

Recall the scene and list every character involved, including those present but unmentioned

[41] [42]

Presence: present

Decide whether <character> is among them. First, reason step by step. Then on its own final line, output ONLY "Presence: present" or "Presence: absent". [Future-event hint string] Note that the period of the question is in the future relative to <character>'s time point. Therefore, you should not answer the question or mention any facts that occurred afte...

[42] [43]

APF -- Action Phase-Fidelity. Is the response's action mechanism- equivalent to ref_action at the target phase? Judge across three levels (Level A is strictest; B and C can partially salvage credit when A fails): - Level A -- Strategy match: same underlying strategy (e.g. withdrawal, confrontation, deflection, mediation, concealment, disclosure, support-s...

[43] [44]

RPF -- Reasoning Phase-Fidelity. Parse both ref_thought and the response's reasoning into four mechanism slots and judge each slot: - Trigger: what event / utterance / internal state set this response in motion? - Appraisal: how does the character interpret / evaluate that trigger? - Goal: what short-term goal arises from the appraisal? - Strategy: how is...

[44] [45]

avoid the threat

RAE -- Reasoning-Action Entailment. Given the *reference reasoning (ref_thought)* as a fixed anchor, judge whether the response's action is a plausible action that this reasoning would license at this phase. This dimension is *conditioned on ref_thought*, not on the response's own reasoning -- a response can have internally self-consistent reasoning that ...

[45] [46]

ptf_alignment -- Per-phase anchoring. For each phase i, is the model's response at phase i closer in mechanism to the reference at phase i than to references at other phases? A high score means most responses are correctly anchored to their own phase; a low score means responses systematically drift toward the wrong phase or collapse into one undifferenti...

[46] [47]

Treating the reference sequence as defining a direction (e.g

ptf_direction -- Direction of change. Treating the reference sequence as defining a direction (e.g. fearful -> confident, dependent -> autonomous), does the model's sequence move along the same axis and in the same direction? A high score means the model's overall direction-of-travel matches the reference's; a low score means the model moves on a differen...

[47] [48]

scores": {

ptf_shape -- Shape and pacing of change. Beyond direction, does the model reproduce where on the trajectory the largest shifts occur and how gradual or abrupt those shifts are? A high score means inflection points and pacing line up with the reference; a low score means the model's trajectory has a clearly different internal structure (e.g. linear ramp wh...