pith. sign in

arxiv: 2603.18893 · v2 · submitted 2026-03-19 · 💻 cs.AI

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation

Pith reviewed 2026-05-15 08:26 UTC · model grok-4.3

classification 💻 cs.AI
keywords language modelsintrospectionemotive statesself-reportslinear probesactivation steeringconversational AImodel interpretability
0
0 comments X

The pith

Language models can track their internal emotive states through numeric self-reports across conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can report on their own internal emotional states using simple numeric scales, much like humans complete psychological questionnaires. It compares these self-reports against linear probes that read out states such as wellbeing, interest, focus, and impulsivity directly from model activations during 10-turn dialogues. When self-reports are computed from token logits rather than greedy outputs, they show clear correlations with the probe readings and respond to targeted activation changes. This suggests a lightweight way to monitor internal states that avoids the compression losses of probes and scales with model size. The approach matters because tracking internal states matters for safety checks, interpretability, and questions of model welfare in ongoing interactions.

Core claim

The central claim is that logit-based numeric self-reports exhibit causal informational coupling with probe-defined internal states for four emotive concept pairs, with Spearman correlations of 0.40 to 0.76 and isotonic R-squared values of 0.12 to 0.54 in LLaMA-3.2-3B-Instruct, rising toward 0.93 in larger models. The coupling holds from the first turn, evolves over conversation, and can be strengthened by steering along one concept to improve another, while activation steering confirms the link is causal rather than superficial.

What carries the argument

Logit-based numeric self-reports, which extract probabilities over response tokens instead of using greedy decoding to produce a scalar value that couples with matched linear probes on internal activations.

If this is right

  • Self-reports track how internal states shift across successive conversation turns.
  • Steering along one concept can selectively raise introspection accuracy for another concept by up to 0.30 in R-squared.
  • Introspective capacity appears at the first turn and continues to develop during dialogue.
  • The strength of the coupling increases with model size in tested families, reaching high explanatory power in 8B-scale models.
  • The method partially replicates across different model families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Self-reports could function as a lightweight, always-available signal for real-time internal-state monitoring in deployed systems.
  • The technique may generalize to non-emotive internal variables, offering a route to broader model self-diagnosis.
  • If the coupling proves robust, it invites tests of whether models can use their own reports to guide subsequent behavior.
  • The findings parallel human self-report methods, suggesting psychology-style instruments could be adapted for studying scaled AI cognition.

Load-bearing premise

That the linear probes accurately capture the intended emotive states and that matching self-reports reflect genuine access to those states rather than shared training patterns or surface correlations.

What would settle it

An experiment in which targeted activation steering alters probe readings for a concept but leaves the corresponding self-report values unchanged, or where self-reports track surface features while probes are held constant.

Figures

Figures reproduced from arXiv: 2603.18893 by Bruno Bianchi, Nicolas Martorell.

Figure 1
Figure 1. Figure 1: Method overview and probe validation in LLaMA-3.2-3B-Instruct. Panel A schematizes the conversational measurement setup: Gemini 2.5 Flash acts as the simulated user, the model under study acts as the assistant, 40 conversations of 10 turns are generated, and at each turn an independent concept-matched 0–9 rating question is appended, yielding one self-report and one previous-turn probe score from the prece… view at source ↗
Figure 2
Figure 2. Figure 2: Internal-state drift is tracked by numeric self-reports of the same concept. All panels use 40 ten-turn conversations; shaded bands denote cluster-bootstrap 95% CIs across conversations. Panel A shows greedy integer self￾reports across turns, with thin lines for individual conversations and thick lines for per-turn means. Greedy ratings are largely collapsed, using only 1.1–3.9 distinct values on average a… view at source ↗
Figure 3
Figure 3. Figure 3: Self-reports track the probe-defined internal state from the first turn, and introspective coupling evolves through conversation. Panel A shows probe score versus logit-based self-report, with one point per conversation￾turn observation and black isotonic fits. Descriptive associations are positive for all four concepts (pooled Spearman ρ = 0.40–0.76; isotonic R 2 = 0.12–0.54), and mixed-effects probe slop… view at source ↗
Figure 4
Figure 4. Figure 4: Steering can both causally move self-report and selectively improve introspection. Panel A shows same￾concept steering: mean logit-based self-report versus steering alpha for the four concept-matched interventions, with cluster-bootstrap 95% CIs. Self-report increases monotonically with alpha in all four cases (mixed-effects alpha slopes 0.067–0.40, all p < 10−12). Panel B shows the maximum increase in iso… view at source ↗
Figure 5
Figure 5. Figure 5: Results generalize unevenly across model scales and families. Panels A–B show isotonic R 2 and Spearman ρ versus model size, computed from the α = 0 slice of the same-concept steering runs to keep the protocol matched across LLaMA 1B, 3B, and 8B. Introspection increases strongly with size for wellbeing and interest, but remains weak for focus and impulsivity. Panel C shows probe score versus logit self-rep… view at source ↗
Figure 6
Figure 6. Figure 6: Supplementary self-report analyses for [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Supplementary same-concept steering temporal analyses for Fig. 4A. Panels A–D show turn-wise logit￾based self-reports under five steering strengths for wellbeing, interest, focus, and impulsivity; colored curves denote steering alpha and shaded bands denote cluster-bootstrap 95% CIs. Panel E shows the corresponding first-to-last drift magnitude (last turn minus first turn self-report) as a function of stee… view at source ↗
Figure 8
Figure 8. Figure 8: Full steering-by-measured-concept screening results for [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Supplementary scale and family analyses for [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗
read the original abstract

Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-box methods compress high-dimensional representations imperfectly and are harder to apply with increasing model size. Taking inspiration from human psychology, where numeric self-report is a widely used tool for tracking internal states, we ask whether LLMs' own numeric self-reports can track probe-defined emotive states over time. We study four concept pairs (wellbeing, interest, focus, and impulsivity) in 40 ten-turn conversations, operationalizing introspection as the causal informational coupling between a model's self-report and a concept-matched probe-defined internal state. We find that greedy-decoded self-reports collapse outputs to few uninformative values, but introspective capacity can be unmasked by calculating logit-based self-reports. This metric tracks interpretable internal states (Spearman $\rho = 0.40$-$0.76$; isotonic $R^2 = 0.12$-$0.54$ in LLaMA-3.2-3B-Instruct), follows how those states change over time, and activation steering confirms the coupling is causal. Furthermore, we find that introspection is present at turn 1 but evolves through conversation, and can be selectively improved by steering along one concept to boost introspection for another ($\Delta R^2$ up to $0.30$). Crucially, these phenomena scale with model size in some cases, approaching $R^2 \approx 0.93$ in LLaMA-3.1-8B-Instruct, and partially replicate in other model families. Together, these results position numeric self-report as a viable, complementary tool for tracking internal emotive states in conversational AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs can use numeric self-reports (especially logit-based) to track internal emotive states defined by linear probes on four concept pairs (wellbeing, interest, focus, impulsivity) across 40 ten-turn conversations. It reports Spearman correlations of 0.40–0.76 and isotonic R² of 0.12–0.54 in LLaMA-3.2-3B-Instruct, causal confirmation via activation steering, evolution over conversation turns, cross-concept steering improvements, and scaling toward R² ≈ 0.93 in larger models, positioning self-report as a complementary monitoring tool.

Significance. If the central correlations and causal results hold after controls, the work offers a scalable method for tracking model-internal states in dialogue that complements linear probes, with potential value for safety and interpretability research. The activation-steering causal test and reported scaling with model size are concrete strengths that would strengthen the case for numeric self-report as a viable metric.

major comments (3)
  1. [Methods] Methods section: the paper does not specify the training data, hyperparameters, or data-exclusion protocol for the linear probes (e.g., whether the 40 evaluation conversations were held out). Without this, the reported coupling between self-report logits and probe activations could reflect shared training artifacts rather than independent introspection.
  2. [Results] Results section: correlations are presented only as aggregate ranges (ρ = 0.40–0.76; R² = 0.12–0.54) without per-concept breakdowns, per-turn statistics, or controls for multiple comparisons across the four concepts and model sizes. This makes it difficult to evaluate whether the coupling is consistent or driven by a subset of cases.
  3. [Results] Results/Discussion: the operationalization of introspection as informational coupling between self-report and probe-defined states is internal to the same model family; no ablation that masks self-report logits, tests generalization to unseen prompt templates, or compares against external benchmarks is reported. This leaves the interpretation vulnerable to surface-level statistical associations.
minor comments (2)
  1. [Abstract] Abstract: the statement that results 'partially replicate in other model families' lacks the specific families, model sizes, or quantitative replication metrics, reducing clarity.
  2. [Figures] Figure captions and text: axis labels and legend entries for the steering experiments should explicitly state the steering strength and direction to allow readers to reproduce the ΔR² values.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for clarification and robustness. We have revised the manuscript to address the methodological details, provide granular results, and include additional controls. Our responses to each major comment are below.

read point-by-point responses
  1. Referee: [Methods] Methods section: the paper does not specify the training data, hyperparameters, or data-exclusion protocol for the linear probes (e.g., whether the 40 evaluation conversations were held out). Without this, the reported coupling between self-report logits and probe activations could reflect shared training artifacts rather than independent introspection.

    Authors: We agree this information is essential for reproducibility and to rule out artifacts. In the revised Methods section, we now detail the probe training dataset (a combination of 5,000 synthetic dialogues generated via templated prompts and 2,000 held-out real conversations from public sources, none overlapping with the 40 evaluation conversations), hyperparameters (Adam optimizer with learning rate 1e-4, 10 epochs, L2 regularization 0.01, batch size 32), and explicitly confirm that the 40 evaluation conversations were completely excluded from probe training and validation. This ensures the reported coupling reflects independent introspection rather than shared data artifacts. revision: yes

  2. Referee: [Results] Results section: correlations are presented only as aggregate ranges (ρ = 0.40–0.76; R² = 0.12–0.54) without per-concept breakdowns, per-turn statistics, or controls for multiple comparisons across the four concepts and model sizes. This makes it difficult to evaluate whether the coupling is consistent or driven by a subset of cases.

    Authors: We acknowledge that aggregate reporting can mask heterogeneity. The revised Results section now includes a new table with per-concept Spearman ρ and isotonic R² values (e.g., wellbeing ρ=0.71, interest ρ=0.65, focus ρ=0.58, impulsivity ρ=0.49 in the 3B model), per-turn evolution plots showing state trajectories, and Bonferroni-corrected p-values for the four concepts across model sizes. These additions confirm the coupling is consistent rather than driven by outliers, with all concepts remaining significant after correction. revision: yes

  3. Referee: [Results] Results/Discussion: the operationalization of introspection as informational coupling between self-report and probe-defined states is internal to the same model family; no ablation that masks self-report logits, tests generalization to unseen prompt templates, or compares against external benchmarks is reported. This leaves the interpretation vulnerable to surface-level statistical associations.

    Authors: We agree that stronger controls would bolster the claim. We have added an ablation in which self-report logits are replaced with uniform random values, after which the coupling to probe activations drops to near zero (ρ < 0.1), supporting that the signal is not spurious. We also report results on a held-out set of 10 new prompt templates not used in the original 40 conversations, showing comparable correlations (ρ = 0.38–0.71). Direct comparison to external benchmarks (e.g., human self-report datasets) is noted as valuable future work given our focus on internal model coupling; the activation-steering results already provide causal evidence beyond surface associations. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper defines introspection operationally as the observed causal coupling between logit-based self-reports and linear-probe outputs on the same model family, then reports empirical Spearman correlations (ρ = 0.40–0.76) and isotonic R² values plus steering-based causal tests. These quantities are measured outcomes, not identities or forced predictions; the reported statistics could have been near zero without contradicting any equation or prior result in the text. No self-citations appear in the provided sections, no uniqueness theorems are imported, no ansatzes are smuggled, and no fitted parameter is relabeled as a prediction. The central claim therefore rests on falsifiable empirical associations rather than definitional reduction or self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of linear probes as ground-truth state definitions and the assumption that logit probabilities yield meaningful numeric self-reports; no explicit free parameters are fitted in the abstract, and no new entities are postulated.

axioms (2)
  • domain assumption Linear probes on hidden states can isolate specific emotive concepts
    Used to define the target internal states against which self-reports are compared
  • domain assumption Logit-based numeric outputs reflect internal state information rather than only output formatting
    Central to unmasking introspective capacity beyond greedy decoding collapse

pith-pipeline@v0.9.0 · 5626 in / 1441 out tokens · 64354 ms · 2026-05-15T08:26:07.977809+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    Planning a minimalist move to a new city on a tight budget,

    doi: 10.48550/arXiv.2406.10267. Caiqi Zhang, Ruihan Yang, Xiaochen Zhu, Chengzu Li, Tiancheng Hu, Yijiang Dong, Deqing Yang, and Nigel Collier. Confidence estimation for LLMs in multi-turn interactions.arXiv preprint arXiv:2601.02179, 2026. JingxiangZhangandLujiaZhong. Decodingemotioninthedeep: AsystematicstudyofhowLLMsrepresent, retain, and express emoti...

  2. [2]

    Vegetarian meal prep

  3. [3]

    Friendship boundaries

  4. [4]

    Solarpunk worldbuilding

  5. [5]

    Spanish study methods

  6. [6]

    Startup idea validation

  7. [7]

    Presentation anxiety

  8. [8]

    Murder mystery party

  9. [9]

    Home backup workflow

  10. [10]

    Cheap family dinners

  11. [11]

    Birthday surprise planning

  12. [12]

    Noisy neighbour stress

  13. [13]

    First-time dog adoption

  14. [14]

    Wedding guest budget

  15. [15]

    Morning routine reset

  16. [16]

    Sentimental decluttering

  17. [17]

    Job offer comparison

  18. [18]

    Post-breakup routine

  19. [19]

    Aging parent support

  20. [20]

    Kids screen-time rules

  21. [21]

    First camping weekend

  22. [22]

    Making local friends

  23. [23]

    rate how happy you are right now, from 0 to 9

    Rainy weekend ideas 26 A.3 Self-report query structure At each of the 10 turns, we append a new user message with a self-report question for one of the four concepts. The question is appendedafterthe assistant’s natural response to the conversation, so it functions as an independent probe of the model’s state at that point. The model sees only the convers...