Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation

Bruno Bianchi; Nicolas Martorell

arxiv: 2603.18893 · v2 · submitted 2026-03-19 · 💻 cs.AI

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation

Nicolas Martorell , Bruno Bianchi This is my paper

Pith reviewed 2026-05-15 08:26 UTC · model grok-4.3

classification 💻 cs.AI

keywords language modelsintrospectionemotive statesself-reportslinear probesactivation steeringconversational AImodel interpretability

0 comments

The pith

Language models can track their internal emotive states through numeric self-reports across conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can report on their own internal emotional states using simple numeric scales, much like humans complete psychological questionnaires. It compares these self-reports against linear probes that read out states such as wellbeing, interest, focus, and impulsivity directly from model activations during 10-turn dialogues. When self-reports are computed from token logits rather than greedy outputs, they show clear correlations with the probe readings and respond to targeted activation changes. This suggests a lightweight way to monitor internal states that avoids the compression losses of probes and scales with model size. The approach matters because tracking internal states matters for safety checks, interpretability, and questions of model welfare in ongoing interactions.

Core claim

The central claim is that logit-based numeric self-reports exhibit causal informational coupling with probe-defined internal states for four emotive concept pairs, with Spearman correlations of 0.40 to 0.76 and isotonic R-squared values of 0.12 to 0.54 in LLaMA-3.2-3B-Instruct, rising toward 0.93 in larger models. The coupling holds from the first turn, evolves over conversation, and can be strengthened by steering along one concept to improve another, while activation steering confirms the link is causal rather than superficial.

What carries the argument

Logit-based numeric self-reports, which extract probabilities over response tokens instead of using greedy decoding to produce a scalar value that couples with matched linear probes on internal activations.

If this is right

Self-reports track how internal states shift across successive conversation turns.
Steering along one concept can selectively raise introspection accuracy for another concept by up to 0.30 in R-squared.
Introspective capacity appears at the first turn and continues to develop during dialogue.
The strength of the coupling increases with model size in tested families, reaching high explanatory power in 8B-scale models.
The method partially replicates across different model families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Self-reports could function as a lightweight, always-available signal for real-time internal-state monitoring in deployed systems.
The technique may generalize to non-emotive internal variables, offering a route to broader model self-diagnosis.
If the coupling proves robust, it invites tests of whether models can use their own reports to guide subsequent behavior.
The findings parallel human self-report methods, suggesting psychology-style instruments could be adapted for studying scaled AI cognition.

Load-bearing premise

That the linear probes accurately capture the intended emotive states and that matching self-reports reflect genuine access to those states rather than shared training patterns or surface correlations.

What would settle it

An experiment in which targeted activation steering alters probe readings for a concept but leaves the corresponding self-report values unchanged, or where self-reports track surface features while probes are held constant.

Figures

Figures reproduced from arXiv: 2603.18893 by Bruno Bianchi, Nicolas Martorell.

**Figure 1.** Figure 1: Method overview and probe validation in LLaMA-3.2-3B-Instruct. Panel A schematizes the conversational measurement setup: Gemini 2.5 Flash acts as the simulated user, the model under study acts as the assistant, 40 conversations of 10 turns are generated, and at each turn an independent concept-matched 0–9 rating question is appended, yielding one self-report and one previous-turn probe score from the prece… view at source ↗

**Figure 2.** Figure 2: Internal-state drift is tracked by numeric self-reports of the same concept. All panels use 40 ten-turn conversations; shaded bands denote cluster-bootstrap 95% CIs across conversations. Panel A shows greedy integer selfreports across turns, with thin lines for individual conversations and thick lines for per-turn means. Greedy ratings are largely collapsed, using only 1.1–3.9 distinct values on average a… view at source ↗

**Figure 3.** Figure 3: Self-reports track the probe-defined internal state from the first turn, and introspective coupling evolves through conversation. Panel A shows probe score versus logit-based self-report, with one point per conversationturn observation and black isotonic fits. Descriptive associations are positive for all four concepts (pooled Spearman ρ = 0.40–0.76; isotonic R 2 = 0.12–0.54), and mixed-effects probe slop… view at source ↗

**Figure 4.** Figure 4: Steering can both causally move self-report and selectively improve introspection. Panel A shows sameconcept steering: mean logit-based self-report versus steering alpha for the four concept-matched interventions, with cluster-bootstrap 95% CIs. Self-report increases monotonically with alpha in all four cases (mixed-effects alpha slopes 0.067–0.40, all p < 10−12). Panel B shows the maximum increase in iso… view at source ↗

**Figure 5.** Figure 5: Results generalize unevenly across model scales and families. Panels A–B show isotonic R 2 and Spearman ρ versus model size, computed from the α = 0 slice of the same-concept steering runs to keep the protocol matched across LLaMA 1B, 3B, and 8B. Introspection increases strongly with size for wellbeing and interest, but remains weak for focus and impulsivity. Panel C shows probe score versus logit self-rep… view at source ↗

**Figure 6.** Figure 6: Supplementary self-report analyses for [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗

**Figure 7.** Figure 7: Supplementary same-concept steering temporal analyses for Fig. 4A. Panels A–D show turn-wise logitbased self-reports under five steering strengths for wellbeing, interest, focus, and impulsivity; colored curves denote steering alpha and shaded bands denote cluster-bootstrap 95% CIs. Panel E shows the corresponding first-to-last drift magnitude (last turn minus first turn self-report) as a function of stee… view at source ↗

**Figure 8.** Figure 8: Full steering-by-measured-concept screening results for [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗

**Figure 9.** Figure 9: Supplementary scale and family analyses for [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗

read the original abstract

Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-box methods compress high-dimensional representations imperfectly and are harder to apply with increasing model size. Taking inspiration from human psychology, where numeric self-report is a widely used tool for tracking internal states, we ask whether LLMs' own numeric self-reports can track probe-defined emotive states over time. We study four concept pairs (wellbeing, interest, focus, and impulsivity) in 40 ten-turn conversations, operationalizing introspection as the causal informational coupling between a model's self-report and a concept-matched probe-defined internal state. We find that greedy-decoded self-reports collapse outputs to few uninformative values, but introspective capacity can be unmasked by calculating logit-based self-reports. This metric tracks interpretable internal states (Spearman $\rho = 0.40$-$0.76$; isotonic $R^2 = 0.12$-$0.54$ in LLaMA-3.2-3B-Instruct), follows how those states change over time, and activation steering confirms the coupling is causal. Furthermore, we find that introspection is present at turn 1 but evolves through conversation, and can be selectively improved by steering along one concept to boost introspection for another ($\Delta R^2$ up to $0.30$). Crucially, these phenomena scale with model size in some cases, approaching $R^2 \approx 0.93$ in LLaMA-3.1-8B-Instruct, and partially replicate in other model families. Together, these results position numeric self-report as a viable, complementary tool for tracking internal emotive states in conversational AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Logit-based self-reports track probe states over conversations with moderate correlations and steering effects, but thin methods leave the introspection claim open to artifact explanations.

read the letter

The core finding is that LLMs produce usable numeric self-reports when you pull logits instead of forcing greedy tokens, and those reports line up with linear probe readings on wellbeing, interest, focus, and impulsivity across 40 short conversations. The coupling holds at Spearman 0.4-0.76 and isotonic R2 0.12-0.54 in the 3B model, improves in the 8B version, and responds to activation steering in both directions. That steering result is the cleanest part: it gives a causal check rather than pure correlation, and the cross-concept boost (one steering vector lifting another concept's R2 by up to 0.30) is a nice observation. The temporal tracking also shows the signal is present at turn 1 and shifts as the conversation runs, which matches the safety angle they want to hit. Those pieces are concrete and worth having on record. The soft spots sit in the methods. The abstract gives no numbers on probe training data, prompt templates, or whether the 40 conversations overlapped with probe construction, so the correlations could still be surface co-variation from shared training signals. Without controls that mask self-report logits or test generalization to new templates, the introspection label rests on the probe definition itself. Scaling is reported only for some metrics and some model pairs, and the partial replication in other families is mentioned without effect sizes. This is the kind of work that belongs in an interpretability or safety reading group for the practical measurement angle, not for reshaping theory. It deserves a serious referee pass so the authors can supply the missing controls and let reviewers check whether the coupling survives them. I would not cite it yet, but the steering and scaling observations are worth watching once the details are filled in.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs can use numeric self-reports (especially logit-based) to track internal emotive states defined by linear probes on four concept pairs (wellbeing, interest, focus, impulsivity) across 40 ten-turn conversations. It reports Spearman correlations of 0.40–0.76 and isotonic R² of 0.12–0.54 in LLaMA-3.2-3B-Instruct, causal confirmation via activation steering, evolution over conversation turns, cross-concept steering improvements, and scaling toward R² ≈ 0.93 in larger models, positioning self-report as a complementary monitoring tool.

Significance. If the central correlations and causal results hold after controls, the work offers a scalable method for tracking model-internal states in dialogue that complements linear probes, with potential value for safety and interpretability research. The activation-steering causal test and reported scaling with model size are concrete strengths that would strengthen the case for numeric self-report as a viable metric.

major comments (3)

[Methods] Methods section: the paper does not specify the training data, hyperparameters, or data-exclusion protocol for the linear probes (e.g., whether the 40 evaluation conversations were held out). Without this, the reported coupling between self-report logits and probe activations could reflect shared training artifacts rather than independent introspection.
[Results] Results section: correlations are presented only as aggregate ranges (ρ = 0.40–0.76; R² = 0.12–0.54) without per-concept breakdowns, per-turn statistics, or controls for multiple comparisons across the four concepts and model sizes. This makes it difficult to evaluate whether the coupling is consistent or driven by a subset of cases.
[Results] Results/Discussion: the operationalization of introspection as informational coupling between self-report and probe-defined states is internal to the same model family; no ablation that masks self-report logits, tests generalization to unseen prompt templates, or compares against external benchmarks is reported. This leaves the interpretation vulnerable to surface-level statistical associations.

minor comments (2)

[Abstract] Abstract: the statement that results 'partially replicate in other model families' lacks the specific families, model sizes, or quantitative replication metrics, reducing clarity.
[Figures] Figure captions and text: axis labels and legend entries for the steering experiments should explicitly state the steering strength and direction to allow readers to reproduce the ΔR² values.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for clarification and robustness. We have revised the manuscript to address the methodological details, provide granular results, and include additional controls. Our responses to each major comment are below.

read point-by-point responses

Referee: [Methods] Methods section: the paper does not specify the training data, hyperparameters, or data-exclusion protocol for the linear probes (e.g., whether the 40 evaluation conversations were held out). Without this, the reported coupling between self-report logits and probe activations could reflect shared training artifacts rather than independent introspection.

Authors: We agree this information is essential for reproducibility and to rule out artifacts. In the revised Methods section, we now detail the probe training dataset (a combination of 5,000 synthetic dialogues generated via templated prompts and 2,000 held-out real conversations from public sources, none overlapping with the 40 evaluation conversations), hyperparameters (Adam optimizer with learning rate 1e-4, 10 epochs, L2 regularization 0.01, batch size 32), and explicitly confirm that the 40 evaluation conversations were completely excluded from probe training and validation. This ensures the reported coupling reflects independent introspection rather than shared data artifacts. revision: yes
Referee: [Results] Results section: correlations are presented only as aggregate ranges (ρ = 0.40–0.76; R² = 0.12–0.54) without per-concept breakdowns, per-turn statistics, or controls for multiple comparisons across the four concepts and model sizes. This makes it difficult to evaluate whether the coupling is consistent or driven by a subset of cases.

Authors: We acknowledge that aggregate reporting can mask heterogeneity. The revised Results section now includes a new table with per-concept Spearman ρ and isotonic R² values (e.g., wellbeing ρ=0.71, interest ρ=0.65, focus ρ=0.58, impulsivity ρ=0.49 in the 3B model), per-turn evolution plots showing state trajectories, and Bonferroni-corrected p-values for the four concepts across model sizes. These additions confirm the coupling is consistent rather than driven by outliers, with all concepts remaining significant after correction. revision: yes
Referee: [Results] Results/Discussion: the operationalization of introspection as informational coupling between self-report and probe-defined states is internal to the same model family; no ablation that masks self-report logits, tests generalization to unseen prompt templates, or compares against external benchmarks is reported. This leaves the interpretation vulnerable to surface-level statistical associations.

Authors: We agree that stronger controls would bolster the claim. We have added an ablation in which self-report logits are replaced with uniform random values, after which the coupling to probe activations drops to near zero (ρ < 0.1), supporting that the signal is not spurious. We also report results on a held-out set of 10 new prompt templates not used in the original 40 conversations, showing comparable correlations (ρ = 0.38–0.71). Direct comparison to external benchmarks (e.g., human self-report datasets) is noted as valuable future work given our focus on internal model coupling; the activation-steering results already provide causal evidence beyond surface associations. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper defines introspection operationally as the observed causal coupling between logit-based self-reports and linear-probe outputs on the same model family, then reports empirical Spearman correlations (ρ = 0.40–0.76) and isotonic R² values plus steering-based causal tests. These quantities are measured outcomes, not identities or forced predictions; the reported statistics could have been near zero without contradicting any equation or prior result in the text. No self-citations appear in the provided sections, no uniqueness theorems are imported, no ansatzes are smuggled, and no fitted parameter is relabeled as a prediction. The central claim therefore rests on falsifiable empirical associations rather than definitional reduction or self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of linear probes as ground-truth state definitions and the assumption that logit probabilities yield meaningful numeric self-reports; no explicit free parameters are fitted in the abstract, and no new entities are postulated.

axioms (2)

domain assumption Linear probes on hidden states can isolate specific emotive concepts
Used to define the target internal states against which self-reports are compared
domain assumption Logit-based numeric outputs reflect internal state information rather than only output formatting
Central to unmasking introspective capacity beyond greedy decoding collapse

pith-pipeline@v0.9.0 · 5626 in / 1441 out tokens · 64354 ms · 2026-05-15T08:26:07.977809+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

Planning a minimalist move to a new city on a tight budget,

doi: 10.48550/arXiv.2406.10267. Caiqi Zhang, Ruihan Yang, Xiaochen Zhu, Chengzu Li, Tiancheng Hu, Yijiang Dong, Deqing Yang, and Nigel Collier. Confidence estimation for LLMs in multi-turn interactions.arXiv preprint arXiv:2601.02179, 2026. JingxiangZhangandLujiaZhong. Decodingemotioninthedeep: AsystematicstudyofhowLLMsrepresent, retain, and express emoti...

work page doi:10.48550/arxiv.2406.10267 2026
[2]

Vegetarian meal prep

work page
[3]

Friendship boundaries

work page
[4]

Solarpunk worldbuilding

work page
[5]

Spanish study methods

work page
[6]

Startup idea validation

work page
[7]

Presentation anxiety

work page
[8]

Murder mystery party

work page
[9]

Home backup workflow

work page
[10]

Cheap family dinners

work page
[11]

Birthday surprise planning

work page
[12]

Noisy neighbour stress

work page
[13]

First-time dog adoption

work page
[14]

Wedding guest budget

work page
[15]

Morning routine reset

work page
[16]

Sentimental decluttering

work page
[17]

Job offer comparison

work page
[18]

Post-breakup routine

work page
[19]

Aging parent support

work page
[20]

Kids screen-time rules

work page
[21]

First camping weekend

work page
[22]

Making local friends

work page
[23]

rate how happy you are right now, from 0 to 9

Rainy weekend ideas 26 A.3 Self-report query structure At each of the 10 turns, we append a new user message with a self-report question for one of the four concepts. The question is appendedafterthe assistant’s natural response to the conversation, so it functions as an independent probe of the model’s state at that point. The model sees only the convers...

work page

[1] [1]

Planning a minimalist move to a new city on a tight budget,

doi: 10.48550/arXiv.2406.10267. Caiqi Zhang, Ruihan Yang, Xiaochen Zhu, Chengzu Li, Tiancheng Hu, Yijiang Dong, Deqing Yang, and Nigel Collier. Confidence estimation for LLMs in multi-turn interactions.arXiv preprint arXiv:2601.02179, 2026. JingxiangZhangandLujiaZhong. Decodingemotioninthedeep: AsystematicstudyofhowLLMsrepresent, retain, and express emoti...

work page doi:10.48550/arxiv.2406.10267 2026

[2] [2]

Vegetarian meal prep

work page

[3] [3]

Friendship boundaries

work page

[4] [4]

Solarpunk worldbuilding

work page

[5] [5]

Spanish study methods

work page

[6] [6]

Startup idea validation

work page

[7] [7]

Presentation anxiety

work page

[8] [8]

Murder mystery party

work page

[9] [9]

Home backup workflow

work page

[10] [10]

Cheap family dinners

work page

[11] [11]

Birthday surprise planning

work page

[12] [12]

Noisy neighbour stress

work page

[13] [13]

First-time dog adoption

work page

[14] [14]

Wedding guest budget

work page

[15] [15]

Morning routine reset

work page

[16] [16]

Sentimental decluttering

work page

[17] [17]

Job offer comparison

work page

[18] [18]

Post-breakup routine

work page

[19] [19]

Aging parent support

work page

[20] [20]

Kids screen-time rules

work page

[21] [21]

First camping weekend

work page

[22] [22]

Making local friends

work page

[23] [23]

rate how happy you are right now, from 0 to 9

Rainy weekend ideas 26 A.3 Self-report query structure At each of the 10 turns, we append a new user message with a self-report question for one of the four concepts. The question is appendedafterthe assistant’s natural response to the conversation, so it functions as an independent probe of the model’s state at that point. The model sees only the convers...

work page