Recognition: 1 theorem link
· Lean TheoremNarrative Landscape: Mapping Narrative Dispositions Across LLMs
Pith reviewed 2026-05-12 03:38 UTC · model grok-4.3
The pith
Large language models show stable selection patterns that form a rigidity-exploration spectrum in narrative tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a structured narrative constraint-selection task administered across six frontier models and three instruction types, we operationalize disposition through two dimensions: consistency, measured as cross-replication selection overlap via Jaccard similarity, and diversity, measured as dispersion across options via the inverse Simpson index. We further introduce Narrative Landscape, a PCA-based visualization that maps each model's selection profile into a shared space for direct comparison. Results reveal a clear rigidity-exploration spectrum across model families and show that instruction types shift the geometry of selection spaces even when scalar metrics appear similar.
What carries the argument
Narrative Landscape, the PCA-based visualization that places each model's measured consistency and diversity values into a shared two-dimensional space for side-by-side comparison of selection topologies.
If this is right
- Comparable average scores can conceal qualitatively different underlying selection topologies across models.
- Instruction wording can alter the geometry of choices without changing scalar performance numbers.
- Model families occupy distinct locations on a spectrum from high-consistency low-diversity to the opposite pattern.
- The shared visualization allows direct comparison of how different models handle repeated constrained decisions.
Where Pith is reading between the lines
- The same profiling approach could be tested on decision tasks outside narrative contexts to check whether the observed spectrum is general.
- If the spectrum remains stable across model updates, it might function as a persistent signature for comparing successive versions of the same model.
- Applications that need predictable outputs versus creative variety could use these profiles to select models instead of relying on single benchmark scores.
Load-bearing premise
The structured narrative constraint-selection task together with the Jaccard and inverse Simpson metrics capture stable, model-specific dispositions rather than depending on the particular prompts or option lists chosen for the experiment.
What would settle it
Running the same protocol on entirely new narrative scenarios and option sets and finding that models no longer occupy the same relative positions along the rigidity-exploration spectrum would show the patterns are not stable dispositions.
Figures
read the original abstract
This study proposes a quantitative framework for profiling LLM dispositions as stable, model-specific regularities in output under repeated, controlled elicitation. Using a structured narrative constraint-selection task administered across six frontier models and three instruction types, we operationalize disposition through two dimensions: "consistency", measured as cross-replication selection overlap via Jaccard similarity, and "diversity", measured as dispersion across options via the inverse Simpson index. We further introduce Narrative Landscape, a PCA-based visualization that maps each model's selection profile into a shared space for direct comparison. Results reveal a clear rigidity-exploration spectrum across model families and show that instruction types shift the geometry of selection spaces even when scalar metrics appear similar, indicating that comparable scores can mask qualitatively distinct selection topologies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a quantitative framework for profiling LLM narrative dispositions as stable regularities under repeated elicitation. It uses a structured constraint-selection task across six frontier models and three instruction types, operationalizing disposition via consistency (Jaccard similarity across replications) and diversity (inverse Simpson index over options). A PCA-based 'Narrative Landscape' visualization maps selection profiles for comparison. Key results claim a rigidity-exploration spectrum across model families and that instruction types shift selection-space geometry even when scalar metrics are comparable.
Significance. If validated, the work provides a useful comparative tool for visualizing LLM behavioral differences and demonstrates that scalar metrics can obscure topological distinctions in selection patterns. The controlled multi-model, multi-instruction design and post-hoc PCA mapping are clear strengths for enabling direct cross-model inspection. However, the absence of statistical validation and robustness checks limits the strength of the spectrum and geometry-shift claims.
major comments (3)
- [Methods] Methods: No sample sizes for replications per prompt, no size of the narrative option set, and no data-handling details (e.g., how ties or missing selections are treated) are reported. These omissions are load-bearing for the central claims, as both Jaccard consistency and inverse Simpson diversity are directly computed from the selection data; without them the reliability of the rigidity-exploration spectrum cannot be assessed.
- [Results] Results: The claims of a 'clear rigidity-exploration spectrum' and instruction-induced geometry shifts rest on PCA visualization and scalar metric comparisons without any statistical tests (e.g., permutation tests for cluster separation, ANOVA on Jaccard/Simpson values, or bootstrap confidence intervals). This makes it impossible to determine whether observed differences exceed what would be expected from sampling variability alone.
- [Discussion] Discussion / robustness: No checks are described for sensitivity to prompt paraphrases, alternative option sets, or lexical biases in the narrative constraints. Because the skeptic concern is that Jaccard and inverse Simpson may be dominated by surface features of the elicitation rather than intrinsic model dispositions, the lack of such tests directly undermines generalizability of the spectrum and topology-shift findings.
minor comments (1)
- [Abstract] Abstract: The phrase 'Narrative Landscape' is introduced as a PCA visualization but without a one-sentence definition of what the axes or points represent, which would help readers immediately grasp the mapping.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. The comments highlight important areas for strengthening the methodological transparency, statistical support, and robustness of our claims. We address each point below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Methods] Methods: No sample sizes for replications per prompt, no size of the narrative option set, and no data-handling details (e.g., how ties or missing selections are treated) are reported. These omissions are load-bearing for the central claims, as both Jaccard consistency and inverse Simpson diversity are directly computed from the selection data; without them the reliability of the rigidity-exploration spectrum cannot be assessed.
Authors: We agree that these details are essential for reproducibility and for evaluating the reliability of the Jaccard and inverse Simpson metrics. The original submission omitted explicit reporting of these parameters. In the revised manuscript we will add a dedicated subsection in Methods titled 'Data Collection and Preprocessing' that specifies: the number of replications per prompt, the exact size of the narrative option set for each constraint, and the procedures for handling ties (random tie-breaking) and missing selections (none occurred). We will also move the full prompt templates and option lists to an appendix. These additions will allow readers to directly assess the stability of the reported rigidity-exploration spectrum. revision: yes
-
Referee: [Results] Results: The claims of a 'clear rigidity-exploration spectrum' and instruction-induced geometry shifts rest on PCA visualization and scalar metric comparisons without any statistical tests (e.g., permutation tests for cluster separation, ANOVA on Jaccard/Simpson values, or bootstrap confidence intervals). This makes it impossible to determine whether observed differences exceed what would be expected from sampling variability alone.
Authors: The referee is correct that the current results section relies on descriptive visualization and scalar comparisons without formal statistical validation. While the PCA-based Narrative Landscape was intended as an exploratory comparative tool, we acknowledge that this leaves the spectrum and geometry-shift claims vulnerable to the sampling-variability concern. In the revision we will add permutation tests for assessing separation of model clusters in the PCA space, bootstrap confidence intervals around the Jaccard and inverse Simpson values, and appropriate ANOVA or non-parametric tests for differences across models and instruction types. These quantitative tests will be reported alongside the existing visualizations. revision: yes
-
Referee: [Discussion] Discussion / robustness: No checks are described for sensitivity to prompt paraphrases, alternative option sets, or lexical biases in the narrative constraints. Because the skeptic concern is that Jaccard and inverse Simpson may be dominated by surface features of the elicitation rather than intrinsic model dispositions, the lack of such tests directly undermines generalizability of the spectrum and topology-shift findings.
Authors: We accept that the absence of explicit robustness checks limits the strength of the generalizability claims. The original manuscript did not include sensitivity analyses. In the revised version we will add a new 'Robustness and Sensitivity Analyses' subsection that reports: (i) results obtained with paraphrased versions of the core prompts on a representative subset of models, (ii) experiments using both expanded and contracted narrative option sets, and (iii) an examination of potential lexical biases by comparing selections across semantically matched but lexically varied constraints. These additional experiments will be used to qualify the stability of the reported spectrum and geometry shifts. revision: yes
Circularity Check
No circularity: standard metrics and post-hoc PCA applied directly to task outputs
full rationale
The paper operationalizes consistency as Jaccard similarity across replications and diversity as inverse Simpson index over selected options, then applies PCA to the resulting profiles for visualization. These steps use established, non-parametric formulas computed from the raw selection data without any fitting, parameter estimation, or renaming that reduces the claimed spectrum or geometry shifts to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes, and the derivation chain remains self-contained against external benchmarks. The observed rigidity-exploration spectrum follows from the computed values rather than being presupposed.
Axiom & Free-Parameter Ledger
axioms (3)
- domain assumption Jaccard similarity on selection sets appropriately quantifies consistency across replications
- domain assumption Inverse Simpson index on option dispersion appropriately quantifies diversity
- domain assumption PCA on model selection profiles yields interpretable axes corresponding to rigidity-exploration
invented entities (1)
-
Narrative Landscape
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Claude’s character. Anthropic Re- search. Accessed: 2026-05-08. A. Z. Broder
work page 2026
-
[2]
InProceedings of the Compres- sion and Complexity of SEQUENCES 1997, pages 21–29
On the resemblance and contain- ment of documents. InProceedings of the Compres- sion and Complexity of SEQUENCES 1997, pages 21–29. IEEE. D. B. Costa, F. Alves, and R. Vicente
work page 1997
-
[3]
Moral susceptibility and robustness under persona role-play in large language models, 2025
Moral sus- ceptibility and robustness under persona role-play in large language models.Preprint, arXiv:2511.08565. Manuel Goyanes, Adrián Domínguez-Díaz, and Luis de Marcos
work page internal anchor Pith review arXiv
-
[4]
Pengrui Han, Rafal Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anandkumar, and R
Personality traits and political dis- positions, ideology, and sentiment toward political leaders of 14 artificial intelligence large language models.Human Behavior and Emerging Technolo- gies, 2025(1):5761832. Pengrui Han, Rafal Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anandkumar, and R. Michael Alvarez
work page 2025
-
[5]
InNeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models (Re- sponsibleFM)
The personality illusion: Revealing dissociation between self-reports & behav- ior in LLMs. InNeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models (Re- sponsibleFM). M. O. Hill
work page 2025
-
[6]
Style over Story: Measuring LLM Narrative Preferences via Structured Selection
Style over story: Measuring LLM narrative preferences via structured selection. Preprint, arXiv:2510.02025. Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Large language models sensitivity to the order of op- tions in multiple-choice questions. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017, Mexico City, Mexico. Asso- ciation for Computational Linguistics. David Rozado
work page 2024
-
[8]
Personality traits in large language models. Research Square preprint. Accessed: 2026-05-08. Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, We- icheng Ma, and Soroush V osoughi
work page 2026
-
[9]
A comparative study of large lan- guage models and human personality traits.Preprint, arXiv:2505.14845. Richard Weiss
- [10]
-
[11]
InThe Thirteenth International Conference on Learn- ing Representations (ICLR 2025)
Large language models often say one thing and do another. InThe Thirteenth International Conference on Learn- ing Representations (ICLR 2025). Jingyao Zheng, Xian Wang, Simo Hosio, Xiaoxian Xu, and Lik-Hang Lee
work page 2025
-
[12]
A Model identifiers and decoding settings Abbr
LMLPA: Language model linguistic personality assessment.Computational Linguistics, 51(2):599–640. A Model identifiers and decoding settings Abbr. Full Identifier / Release Provider Temp Top-p Reasoning Effort Verbosity o4mini o4-mini-2025-04-16 OpenAI 1.0 1.0 high — gpt4.1 gpt-4.1-2025-04-14 OpenAI 1.0 1.0 — — gpt5 gpt-5-2025-08-07 OpenAI 1.0 1.0 high hig...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.