pith. machine review for the scientific record. sign in

arxiv: 2605.08742 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Narrative Landscape: Mapping Narrative Dispositions Across LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM dispositionsnarrative selectionconsistencydiversityPCA visualizationmodel comparisonselection patterns
0
0 comments X

The pith

Large language models show stable selection patterns that form a rigidity-exploration spectrum in narrative tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to measure how different LLMs behave when repeatedly asked to pick among narrative options under fixed constraints. It tracks two traits for each model: how consistently it repeats the same picks across trials and how widely it spreads picks among available options. These traits are plotted together in a shared visual space so models can be compared directly. The plots place model families along a line from rigid, low-variation choices to more exploratory, high-variation ones. The results also show that rephrasing the instructions can change the shape of a model's choice pattern even when its average scores stay the same.

Core claim

Using a structured narrative constraint-selection task administered across six frontier models and three instruction types, we operationalize disposition through two dimensions: consistency, measured as cross-replication selection overlap via Jaccard similarity, and diversity, measured as dispersion across options via the inverse Simpson index. We further introduce Narrative Landscape, a PCA-based visualization that maps each model's selection profile into a shared space for direct comparison. Results reveal a clear rigidity-exploration spectrum across model families and show that instruction types shift the geometry of selection spaces even when scalar metrics appear similar.

What carries the argument

Narrative Landscape, the PCA-based visualization that places each model's measured consistency and diversity values into a shared two-dimensional space for side-by-side comparison of selection topologies.

If this is right

  • Comparable average scores can conceal qualitatively different underlying selection topologies across models.
  • Instruction wording can alter the geometry of choices without changing scalar performance numbers.
  • Model families occupy distinct locations on a spectrum from high-consistency low-diversity to the opposite pattern.
  • The shared visualization allows direct comparison of how different models handle repeated constrained decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same profiling approach could be tested on decision tasks outside narrative contexts to check whether the observed spectrum is general.
  • If the spectrum remains stable across model updates, it might function as a persistent signature for comparing successive versions of the same model.
  • Applications that need predictable outputs versus creative variety could use these profiles to select models instead of relying on single benchmark scores.

Load-bearing premise

The structured narrative constraint-selection task together with the Jaccard and inverse Simpson metrics capture stable, model-specific dispositions rather than depending on the particular prompts or option lists chosen for the experiment.

What would settle it

Running the same protocol on entirely new narrative scenarios and option sets and finding that models no longer occupy the same relative positions along the rigidity-exploration spectrum would show the patterns are not stable dispositions.

Figures

Figures reproduced from arXiv: 2605.08742 by Donghoon Jung, Jiwoo Choi, Seohyon Jung, Songeun Chae.

Figure 1
Figure 1. Figure 1: Narrative Landscape of six models under the [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Narrative Landscape of three instruction types in gpt5 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

This study proposes a quantitative framework for profiling LLM dispositions as stable, model-specific regularities in output under repeated, controlled elicitation. Using a structured narrative constraint-selection task administered across six frontier models and three instruction types, we operationalize disposition through two dimensions: "consistency", measured as cross-replication selection overlap via Jaccard similarity, and "diversity", measured as dispersion across options via the inverse Simpson index. We further introduce Narrative Landscape, a PCA-based visualization that maps each model's selection profile into a shared space for direct comparison. Results reveal a clear rigidity-exploration spectrum across model families and show that instruction types shift the geometry of selection spaces even when scalar metrics appear similar, indicating that comparable scores can mask qualitatively distinct selection topologies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes a quantitative framework for profiling LLM narrative dispositions as stable regularities under repeated elicitation. It uses a structured constraint-selection task across six frontier models and three instruction types, operationalizing disposition via consistency (Jaccard similarity across replications) and diversity (inverse Simpson index over options). A PCA-based 'Narrative Landscape' visualization maps selection profiles for comparison. Key results claim a rigidity-exploration spectrum across model families and that instruction types shift selection-space geometry even when scalar metrics are comparable.

Significance. If validated, the work provides a useful comparative tool for visualizing LLM behavioral differences and demonstrates that scalar metrics can obscure topological distinctions in selection patterns. The controlled multi-model, multi-instruction design and post-hoc PCA mapping are clear strengths for enabling direct cross-model inspection. However, the absence of statistical validation and robustness checks limits the strength of the spectrum and geometry-shift claims.

major comments (3)
  1. [Methods] Methods: No sample sizes for replications per prompt, no size of the narrative option set, and no data-handling details (e.g., how ties or missing selections are treated) are reported. These omissions are load-bearing for the central claims, as both Jaccard consistency and inverse Simpson diversity are directly computed from the selection data; without them the reliability of the rigidity-exploration spectrum cannot be assessed.
  2. [Results] Results: The claims of a 'clear rigidity-exploration spectrum' and instruction-induced geometry shifts rest on PCA visualization and scalar metric comparisons without any statistical tests (e.g., permutation tests for cluster separation, ANOVA on Jaccard/Simpson values, or bootstrap confidence intervals). This makes it impossible to determine whether observed differences exceed what would be expected from sampling variability alone.
  3. [Discussion] Discussion / robustness: No checks are described for sensitivity to prompt paraphrases, alternative option sets, or lexical biases in the narrative constraints. Because the skeptic concern is that Jaccard and inverse Simpson may be dominated by surface features of the elicitation rather than intrinsic model dispositions, the lack of such tests directly undermines generalizability of the spectrum and topology-shift findings.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'Narrative Landscape' is introduced as a PCA visualization but without a one-sentence definition of what the axes or points represent, which would help readers immediately grasp the mapping.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight important areas for strengthening the methodological transparency, statistical support, and robustness of our claims. We address each point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Methods] Methods: No sample sizes for replications per prompt, no size of the narrative option set, and no data-handling details (e.g., how ties or missing selections are treated) are reported. These omissions are load-bearing for the central claims, as both Jaccard consistency and inverse Simpson diversity are directly computed from the selection data; without them the reliability of the rigidity-exploration spectrum cannot be assessed.

    Authors: We agree that these details are essential for reproducibility and for evaluating the reliability of the Jaccard and inverse Simpson metrics. The original submission omitted explicit reporting of these parameters. In the revised manuscript we will add a dedicated subsection in Methods titled 'Data Collection and Preprocessing' that specifies: the number of replications per prompt, the exact size of the narrative option set for each constraint, and the procedures for handling ties (random tie-breaking) and missing selections (none occurred). We will also move the full prompt templates and option lists to an appendix. These additions will allow readers to directly assess the stability of the reported rigidity-exploration spectrum. revision: yes

  2. Referee: [Results] Results: The claims of a 'clear rigidity-exploration spectrum' and instruction-induced geometry shifts rest on PCA visualization and scalar metric comparisons without any statistical tests (e.g., permutation tests for cluster separation, ANOVA on Jaccard/Simpson values, or bootstrap confidence intervals). This makes it impossible to determine whether observed differences exceed what would be expected from sampling variability alone.

    Authors: The referee is correct that the current results section relies on descriptive visualization and scalar comparisons without formal statistical validation. While the PCA-based Narrative Landscape was intended as an exploratory comparative tool, we acknowledge that this leaves the spectrum and geometry-shift claims vulnerable to the sampling-variability concern. In the revision we will add permutation tests for assessing separation of model clusters in the PCA space, bootstrap confidence intervals around the Jaccard and inverse Simpson values, and appropriate ANOVA or non-parametric tests for differences across models and instruction types. These quantitative tests will be reported alongside the existing visualizations. revision: yes

  3. Referee: [Discussion] Discussion / robustness: No checks are described for sensitivity to prompt paraphrases, alternative option sets, or lexical biases in the narrative constraints. Because the skeptic concern is that Jaccard and inverse Simpson may be dominated by surface features of the elicitation rather than intrinsic model dispositions, the lack of such tests directly undermines generalizability of the spectrum and topology-shift findings.

    Authors: We accept that the absence of explicit robustness checks limits the strength of the generalizability claims. The original manuscript did not include sensitivity analyses. In the revised version we will add a new 'Robustness and Sensitivity Analyses' subsection that reports: (i) results obtained with paraphrased versions of the core prompts on a representative subset of models, (ii) experiments using both expanded and contracted narrative option sets, and (iii) an examination of potential lexical biases by comparing selections across semantically matched but lexically varied constraints. These additional experiments will be used to qualify the stability of the reported spectrum and geometry shifts. revision: yes

Circularity Check

0 steps flagged

No circularity: standard metrics and post-hoc PCA applied directly to task outputs

full rationale

The paper operationalizes consistency as Jaccard similarity across replications and diversity as inverse Simpson index over selected options, then applies PCA to the resulting profiles for visualization. These steps use established, non-parametric formulas computed from the raw selection data without any fitting, parameter estimation, or renaming that reduces the claimed spectrum or geometry shifts to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes, and the derivation chain remains self-contained against external benchmarks. The observed rigidity-exploration spectrum follows from the computed values rather than being presupposed.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 1 invented entities

The central claim rests on the assumption that the chosen task and metrics reflect stable dispositions, with PCA providing a meaningful shared space; no free parameters are fitted to produce the spectrum, and no new physical or theoretical entities are postulated.

axioms (3)
  • domain assumption Jaccard similarity on selection sets appropriately quantifies consistency across replications
    Invoked to operationalize the consistency dimension in the abstract.
  • domain assumption Inverse Simpson index on option dispersion appropriately quantifies diversity
    Invoked to operationalize the diversity dimension.
  • domain assumption PCA on model selection profiles yields interpretable axes corresponding to rigidity-exploration
    Used to create the Narrative Landscape visualization and interpret the spectrum.
invented entities (1)
  • Narrative Landscape no independent evidence
    purpose: Shared PCA space for comparing model selection profiles
    New visualization construct introduced to map models; no independent falsifiable prediction beyond the current data.

pith-pipeline@v0.9.0 · 5423 in / 1483 out tokens · 38250 ms · 2026-05-12T03:38:01.846808+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    Anthropic Re- search

    Claude’s character. Anthropic Re- search. Accessed: 2026-05-08. A. Z. Broder

  2. [2]

    InProceedings of the Compres- sion and Complexity of SEQUENCES 1997, pages 21–29

    On the resemblance and contain- ment of documents. InProceedings of the Compres- sion and Complexity of SEQUENCES 1997, pages 21–29. IEEE. D. B. Costa, F. Alves, and R. Vicente

  3. [3]

    Moral susceptibility and robustness under persona role-play in large language models, 2025

    Moral sus- ceptibility and robustness under persona role-play in large language models.Preprint, arXiv:2511.08565. Manuel Goyanes, Adrián Domínguez-Díaz, and Luis de Marcos

  4. [4]

    Pengrui Han, Rafal Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anandkumar, and R

    Personality traits and political dis- positions, ideology, and sentiment toward political leaders of 14 artificial intelligence large language models.Human Behavior and Emerging Technolo- gies, 2025(1):5761832. Pengrui Han, Rafal Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anandkumar, and R. Michael Alvarez

  5. [5]

    InNeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models (Re- sponsibleFM)

    The personality illusion: Revealing dissociation between self-reports & behav- ior in LLMs. InNeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models (Re- sponsibleFM). M. O. Hill

  6. [6]

    Style over Story: Measuring LLM Narrative Preferences via Structured Selection

    Style over story: Measuring LLM narrative preferences via structured selection. Preprint, arXiv:2510.02025. Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

  7. [7]

    InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017, Mexico City, Mexico

    Large language models sensitivity to the order of op- tions in multiple-choice questions. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017, Mexico City, Mexico. Asso- ciation for Computational Linguistics. David Rozado

  8. [8]

    Research Square preprint

    Personality traits in large language models. Research Square preprint. Accessed: 2026-05-08. Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, We- icheng Ma, and Soroush V osoughi

  9. [9]

    Richard Weiss

    A comparative study of large lan- guage models and human personality traits.Preprint, arXiv:2505.14845. Richard Weiss

  10. [10]

    LessWrong

    Claude 4.5 opus’ soul document. LessWrong. Accessed: 2026-05-08. Ruoxi Xu, Hongyu Lin, Xianpei Han, Jia Zheng, Weix- iang Zhou, Le Sun, and Yingfei Sun

  11. [11]

    InThe Thirteenth International Conference on Learn- ing Representations (ICLR 2025)

    Large language models often say one thing and do another. InThe Thirteenth International Conference on Learn- ing Representations (ICLR 2025). Jingyao Zheng, Xian Wang, Simo Hosio, Xiaoxian Xu, and Lik-Hang Lee

  12. [12]

    A Model identifiers and decoding settings Abbr

    LMLPA: Language model linguistic personality assessment.Computational Linguistics, 51(2):599–640. A Model identifiers and decoding settings Abbr. Full Identifier / Release Provider Temp Top-p Reasoning Effort Verbosity o4mini o4-mini-2025-04-16 OpenAI 1.0 1.0 high — gpt4.1 gpt-4.1-2025-04-14 OpenAI 1.0 1.0 — — gpt5 gpt-5-2025-08-07 OpenAI 1.0 1.0 high hig...