pith. sign in

arxiv: 2605.09159 · v1 · submitted 2026-05-09 · 💻 cs.AI

Do LLMs Experience an Internal Polylogue? Investigating Reasoning through the Lens of Personas

Pith reviewed 2026-05-12 03:43 UTC · model grok-4.3

classification 💻 cs.AI
keywords polyloguepersona vectorsLLM reasoningactivation steeringlatent monitoringinterpretabilityMMLU-Prodynamic intervention
0
0 comments X

The pith

The time series of persona alignments during generation predicts LLM correctness and enables targeted interventions that raise accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the polylogue as a dynamic record of how different behavioral personas align with a model's hidden states while it generates an answer. It tests whether tracking these alignments over time can forecast whether the final answer will be correct on a hard multiple-choice benchmark. If the approach holds, it supplies an interpretable handle for watching and adjusting reasoning mid-generation rather than only after the fact. The authors then turn the signals into a simple steering method that conditions changes on each paragraph and report accuracy gains on three of the four models tested.

Core claim

We define the polylogue as the time series of alignments between persona vectors and hidden activations across the course of generation. On MMLU-Pro, features derived from these time series predict answer correctness at a level competitive with low-dimensional activation baselines while remaining traceable to specific persona directions. A paragraph-conditioned intervention that modulates the relevant directions at identified stages improves accuracy on three of four open-weight models, indicating that stage-aware latent steering can serve as a practical method for reasoning-time control.

What carries the argument

The polylogue, the time series of alignments between fixed persona vectors and evolving hidden activations during generation.

If this is right

  • Polylogue features supply an interpretable alternative to opaque activation summaries for forecasting correctness on complex benchmarks.
  • Specific persona directions become concrete targets for modulation at particular points in the generation process.
  • Stage-aware steering during response generation can raise accuracy without model retraining.
  • The same monitoring approach can identify which latent directions matter most at early versus late stages of an answer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could extend to tracking how models switch between reasoning styles on open-ended tasks where no single correct answer exists.
  • If polylogues prove causal, they might guide the construction of training objectives that encourage productive internal persona interactions.
  • Combining polylogue signals with other linear probes could yield finer-grained control over when a model adopts cautious versus creative behavior.

Load-bearing premise

Persona vectors identified in earlier work stay stable and linearly separable from one another as hidden states change dynamically throughout a single generation.

What would settle it

A direct test in which the paragraph-conditioned intervention produces no accuracy gain on MMLU-Pro, or in which polylogue-derived features lose all predictive advantage over random directions, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.09159 by Kirill Bykov, Leander Girrbach, Nils A. Herrmann, Zeynep Akata.

Figure 1
Figure 1. Figure 1: Latent trait alignment over response progress. Softmax-normalised mean similarity s¯k(t) of four persona traits as a function of normalised response position. To test this perspective, we introduce a mechanistic pipeline for open-weight reasoning models that constructs persona vectors via contrastive activation differences, tracks their stepwise alignment with model activations during CoT, and quantifies m… view at source ↗
Figure 2
Figure 2. Figure 2: Example generation with and without steering using DeepSeek-R1-Distill-Qwen-14B and the arbiter persona vector. Steering towards the arbiter persona at §4 makes the reasoning stop after committing to a solution. After obtaining persona vectors, we can steer activations to induce a certain persona as seen in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Paragraph label distribution over response progress. Fraction of paragraphs assigned each persona label at each normalised progress bin. the interpreter dominates the opening of the trace and the arbiter the close, while explorer and solver remain consis￾tently engaged throughout the middle. This pattern, distinct personas active at distinct phases of generation, is the kind of latent structure the polylog… view at source ↗
read the original abstract

Recent work shows that large language models (LLMs) encode behavioural traits ("personas") as linear directions in activation space, often called "persona vectors". Prior work has used such directions as static handles for behavioural steering. Building on this, we treat them as dynamic signals instead: probes we can monitor and intervene on as reasoning unfolds. We use the term polylogue to denote the time series of alignments between persona vectors and hidden activations over the course of generation. Experiments across four open-weight models show that polylogue features predict correctness on MMLU-Pro competitively with low-dimensional activation baselines, while remaining interpretable through their associated persona directions. They also suggest concrete steering targets, namely which latent directions to modulate at different stages of a response. We instantiate this as a simple paragraph-conditioned intervention that improves accuracy on three of four models, pointing to stage-aware latent steering as a promising direction for reasoning-time control. Together, this positions the polylogue as an interpretable tool for reasoning-time monitoring and intervention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes treating persona vectors as dynamic signals during generation (termed the 'polylogue', i.e., the time series of alignments with hidden activations) rather than static steering handles. It reports that polylogue-derived features predict correctness on MMLU-Pro competitively with low-dimensional activation baselines across four open-weight models, remain interpretable via their persona directions, and enable a paragraph-conditioned intervention that improves accuracy on three of four models.

Significance. If the empirical results hold under closer scrutiny, the work supplies an interpretable, inference-time tool for monitoring and stage-aware steering of LLM reasoning by repurposing existing persona vectors. The intervention result, in particular, points toward practical latent-space control methods that do not require retraining.

major comments (2)
  1. [Experiments] Experiments section: the central claim of competitive predictive performance and accuracy gains on 3/4 models is presented without reported statistical tests, exact feature-construction details, baseline implementations, or controls for multiple comparisons. This leaves the empirical support for both the prediction and intervention results only moderately grounded.
  2. [Methods] Methods / persona-vector usage: the approach treats vectors extracted in prior static contexts as fixed, linearly separable directions that remain stable and causally relevant when tracked dynamically across token positions and layers on MMLU-Pro. No direct validation (e.g., stability checks or ablation of dynamic vs. static alignment) is supplied; if alignments instead reflect superficial correlations with answer tokens, both the predictive competitiveness and the paragraph-conditioned steering results are at risk.
minor comments (2)
  1. [Abstract] The term 'polylogue' is introduced in the abstract and early sections without an explicit definition or etymological note, which may hinder immediate comprehension for readers.
  2. [Figures] Figure captions and axis labels for the polylogue time-series plots could be expanded to indicate layer indices, token ranges, and exact alignment metric used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your constructive feedback. We have revised the manuscript to strengthen the empirical grounding and add the requested validations. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim of competitive predictive performance and accuracy gains on 3/4 models is presented without reported statistical tests, exact feature-construction details, baseline implementations, or controls for multiple comparisons. This leaves the empirical support for both the prediction and intervention results only moderately grounded.

    Authors: We agree that additional statistical rigor and implementation details are required. In the revised manuscript we now report paired t-tests (with exact p-values) comparing polylogue features against activation baselines for each model. We have expanded the feature-construction description to include precise formulas for per-token alignment computation, paragraph-level aggregation (mean, variance, and linear trend), and the exact dimensionality reduction applied. Baseline implementations are described with references to the original probing papers and include the same layer and token-position sampling used for our features. Multiple-comparison correction via Bonferroni is now applied and stated. These additions directly address the moderate grounding concern. revision: yes

  2. Referee: [Methods] Methods / persona-vector usage: the approach treats vectors extracted in prior static contexts as fixed, linearly separable directions that remain stable and causally relevant when tracked dynamically across token positions and layers on MMLU-Pro. No direct validation (e.g., stability checks or ablation of dynamic vs. static alignment) is supplied; if alignments instead reflect superficial correlations with answer tokens, both the predictive competitiveness and the paragraph-conditioned steering results are at risk.

    Authors: We thank the referee for identifying this assumption. The revised Methods section now contains a dedicated validation subsection that reports (i) cosine-similarity stability of each persona direction across token positions and layers on MMLU-Pro and (ii) an ablation that directly compares predictive performance of the full dynamic polylogue time series against its static (mean-alignment) counterpart. While the competitive results against full activation baselines already suggest the features are not reducible to answer-token correlations, we have added an explicit discussion of this risk and note that stronger causal interventions remain future work. The paragraph-level steering results are now presented alongside these new checks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external prior methods and new experiments

full rationale

The paper defines polylogue as the time series of alignments between persona vectors (from prior external work) and hidden activations during generation. Predictive features are extracted from these alignments and evaluated against MMLU-Pro correctness via standard ML baselines and correlations; the paragraph-conditioned intervention is presented as a simple application of steering at identified stages. No equations or steps reduce the reported predictions or accuracy improvements to quantities defined by parameters fitted inside this paper. The central claims rest on empirical measurements rather than self-definition or self-citation chains that would force the outcomes by construction. This is the most common honest finding for papers that apply established techniques to new dynamic settings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that linear persona directions extracted from prior work remain meaningful when tracked dynamically, plus the new conceptual entity 'polylogue' whose utility is demonstrated only within the reported experiments.

axioms (1)
  • domain assumption Persona vectors extracted as linear directions in activation space represent distinct behavioral traits
    This is the foundational premise taken from prior work and used to interpret the time-series alignments.
invented entities (1)
  • polylogue no independent evidence
    purpose: To denote the time series of alignments between persona vectors and hidden activations over the course of generation
    New term and framing introduced to conceptualize dynamic monitoring; no independent falsifiable handle outside the paper's own experiments.

pith-pipeline@v0.9.0 · 5488 in / 1414 out tokens · 37891 ms · 2026-05-12T03:43:55.423405+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    Persona vectors: Monitoring and controlling character traits in language models

    Chen, R., Arditi, A., Sleight, H., Evans, O., and Lindsey, J. Persona vectors: Monitoring and controlling character traits in language models. InarXiv, 2025

  2. [2]

    Deepseek- r1 incentivizes reasoning in llms through reinforcement learning

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al. Deepseek- r1 incentivizes reasoning in llms through reinforcement learning. InNature, 2025

  3. [3]

    Openai o1 system card

    Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al. Openai o1 system card. InarXiv, 2024

  4. [4]

    What makes a good reasoning chain? uncovering structural patterns in long chain-of-thought reasoning

    Jiang, G., Liu, Y ., Li, Z., Bi, W., Zhang, F., Song, L., Wei, Y ., and Lian, D. What makes a good reasoning chain? uncovering structural patterns in long chain-of-thought reasoning. InEMNLP, 2025

  5. [5]

    Let’s verify step by step

    Cobbe, K. Let’s verify step by step. InICLR, 2024

  6. [6]

    The Persona Selection Model: Why AI Assistants might Behave like Humans

    Marks, S., Lindsey, J., and Olah, C. The Persona Selection Model: Why AI Assistants might Behave like Humans. InAlignment Science Blog, 2026

  7. [7]

    H.Mathematical problem solving

    Schoenfeld, A. H.Mathematical problem solving. Aca- demic Press, 1985

  8. [8]

    Subramani, N., Suresh, N., and Peters, M. E. Extracting latent steering vectors from pretrained language models. InACL (Findings), 2022

  9. [9]

    Lan- guage models don’t always say what they think: Un- faithful explanations in chain-of-thought prompting

    Turpin, M., Michael, J., Perez, E., and Bowman, S. Lan- guage models don’t always say what they think: Un- faithful explanations in chain-of-thought prompting. In NeurIPS, 2023

  10. [10]

    Solv- ing math word problems with process-and outcome-based feedback

    Wang, L., Creswell, A., Irving, G., and Higgins, I. Solv- ing math word problems with process-and outcome-based feedback. InarXiv, 2022

  11. [11]

    Mmlu- pro: A more robust and challenging multi-task language understanding benchmark

    Wang, Y ., Ma, X., Zhang, G., Ni, Y ., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., et al. Mmlu- pro: A more robust and challenging multi-task language understanding benchmark. InNeurIPS, 2024

  12. [12]

    V ., Zhou, D., et al

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V ., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022

  13. [13]

    Mapping the minds of llms: A graph-based analysis of reasoning llms

    Xiong, Z., Cai, Y ., Li, Z., and Wang, Y . Mapping the minds of llms: A graph-based analysis of reasoning llms. In EMNLP, 2025. 5 Polylogue: Investigating LLM Reasoning through the Lens of Personas A. Reasoning Personas Table 5 provides the full mapping from Schoenfeld’s reasoning episodes (Schoenfeld, 1985) to the eight reasoning personas used in our ana...