Stable Personas: Dual-Assessment of Temporal Stability in LLM-Based Human Simulation
Pith reviewed 2026-05-21 14:53 UTC · model grok-4.3
The pith
LLM personas produce stable self-reports across conversations but show declining observer-rated expression in longer dialogues.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Persona-instructed LLMs produce stable, persona-aligned self-reports both between conversations and within extended ones, yet observer ratings show a tendency for persona expressions to decline during longer interactions, establishing this regression as a boundary condition for multi-agent social simulation.
What carries the argument
Dual-assessment framework that measures self-reported characteristics separately from observer-rated persona expression in LLM outputs.
Load-bearing premise
Observer ratings accurately and independently capture the LLM's persona expression without systematic bias from the rating task or output artifacts.
What would settle it
Re-running the observer rating task with new raters, blinded conditions, or an alternative scale that finds no decline over conversation turns would indicate the drop stems from rating artifacts rather than true persona regression.
read the original abstract
Large Language Models (LLMs) acting as artificial agents offer the potential for scalable behavioral research, yet their validity depends on whether LLMs can maintain stable personas across extended conversations. We address this point using a dual-assessment framework measuring both self-reported characteristics and observer-rated persona expression. Across two experiments testing four persona conditions (default, high, moderate, and low ADHD presentations), seven LLMs, and three semantically equivalent persona prompts, we examine between-conversation stability (3,473 conversations) and within-conversation stability (1,370 conversations and 18 turns). Self-reports remain highly stable both between and within conversations. However, observer ratings reveal a tendency for persona expressions to decline during extended conversations. These findings suggest that persona-instructed LLMs produce stable, persona-aligned self-reports, an important prerequisite for behavioral research, while identifying this regression tendency as a boundary condition for multi-agent social simulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that LLM-based personas maintain high temporal stability in self-reported characteristics both between conversations (3,473 dialogues) and within conversations (1,370 dialogues across 18 turns), across seven LLMs and four persona conditions (default, high, moderate, and low ADHD presentations) using three semantically equivalent prompts. A dual-assessment framework shows stable self-reports but a decline in observer-rated persona expression during extended interactions, positioning the latter as a boundary condition for multi-agent social simulation while supporting use for self-report behavioral research.
Significance. If the observer ratings are shown to be independent of rater artifacts, the work supplies large-scale empirical evidence on persona stability that directly informs the validity of LLM agents for scalable behavioral research. The sample sizes and explicit condition counts provide a solid empirical base for the self-report stability result, which is a prerequisite for many simulation applications.
major comments (2)
- [Methods] Observer rating procedure (Methods): The manuscript provides no inter-rater reliability statistics and no validation details for the observer prompts. This is load-bearing for the central claim that declining observer scores demonstrate genuine persona regression rather than rater LLM context-length or recency effects, especially since the rating task supplies full conversation history.
- [Results] Within-conversation results: The reported decline in observer scores is interpreted as a boundary condition, yet no ablation is described that varies rater prompt length, uses human raters on a subsample, or tests rating prompts that summarize rather than supply full history. Without such controls the decline cannot be confidently attributed to the target persona.
minor comments (1)
- [Abstract] Abstract: The phrase 'observer ratings reveal a tendency' is vague; a quantitative description of the magnitude or statistical test for the decline would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the robustness of our dual-assessment approach. Below we provide point-by-point responses to the major comments and indicate planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Methods] Observer rating procedure (Methods): The manuscript provides no inter-rater reliability statistics and no validation details for the observer prompts. This is load-bearing for the central claim that declining observer scores demonstrate genuine persona regression rather than rater LLM context-length or recency effects, especially since the rating task supplies full conversation history.
Authors: We agree that explicit inter-rater reliability and prompt validation details are necessary to support interpretation of the observer ratings. In the revised version we will add inter-rater reliability statistics obtained by having two independent LLM raters score a stratified random subsample of 200 conversations, reporting agreement via Cohen’s kappa and percentage agreement. We will also expand the Methods section with a description of the iterative prompt-validation process used to ensure the observer instructions prioritize persona expression over recency or context-length cues. These additions directly address the concern that the observed decline may reflect rater artifacts. revision: yes
-
Referee: [Results] Within-conversation results: The reported decline in observer scores is interpreted as a boundary condition, yet no ablation is described that varies rater prompt length, uses human raters on a subsample, or tests rating prompts that summarize rather than supply full history. Without such controls the decline cannot be confidently attributed to the target persona.
Authors: We concur that additional controls would increase confidence in attributing the decline to persona regression. We will include a new ablation in the revision that re-rates a subsample of conversations under both full-history and condensed-summary prompt conditions and compares the resulting trajectories. We will also report human ratings on a small random subsample (approximately 50 conversations) to provide an external validity check on the LLM observer scores. While full-scale human rating of the entire corpus remains impractical, these targeted controls should help isolate the contribution of the target persona from rater-specific effects. revision: partial
Circularity Check
No significant circularity in empirical measurement study
full rationale
The paper conducts a purely empirical investigation of LLM persona stability via direct experimentation, collecting self-report and observer-rating data across thousands of conversations without any equations, derivations, fitted parameters, or first-principles claims that reduce to inputs by construction. Stability conclusions emerge from observed patterns in the collected data rather than from re-expression of prior definitions or self-citation chains. No load-bearing self-citations, ansatz smuggling, or renaming of known results appear in the measurement framework; the dual-assessment approach is self-contained against external benchmarks of conversation logs and ratings.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-report and observer ratings are valid and complementary measures of persona trait expression in artificial agents.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Self-reports remain highly stable both between and within conversations. However, observer ratings reveal a tendency for persona expressions to decline during extended conversations.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat ≃ Nat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Variance decomposition via linear mixed models quantifies the contribution of each experimental factor
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
LLM-Based Educational Simulation: Evaluating Temporal Student Persona Stability Across ADHD Profiles
LLM-simulated ADHD student personas show stable self-reported traits but behavioral drift in unscripted interactions that explicit task prompts fully eliminate.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.