Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
Pith reviewed 2026-05-15 08:26 UTC · model grok-4.3
The pith
Language models can track their internal emotive states through numeric self-reports across conversations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that logit-based numeric self-reports exhibit causal informational coupling with probe-defined internal states for four emotive concept pairs, with Spearman correlations of 0.40 to 0.76 and isotonic R-squared values of 0.12 to 0.54 in LLaMA-3.2-3B-Instruct, rising toward 0.93 in larger models. The coupling holds from the first turn, evolves over conversation, and can be strengthened by steering along one concept to improve another, while activation steering confirms the link is causal rather than superficial.
What carries the argument
Logit-based numeric self-reports, which extract probabilities over response tokens instead of using greedy decoding to produce a scalar value that couples with matched linear probes on internal activations.
If this is right
- Self-reports track how internal states shift across successive conversation turns.
- Steering along one concept can selectively raise introspection accuracy for another concept by up to 0.30 in R-squared.
- Introspective capacity appears at the first turn and continues to develop during dialogue.
- The strength of the coupling increases with model size in tested families, reaching high explanatory power in 8B-scale models.
- The method partially replicates across different model families.
Where Pith is reading between the lines
- Self-reports could function as a lightweight, always-available signal for real-time internal-state monitoring in deployed systems.
- The technique may generalize to non-emotive internal variables, offering a route to broader model self-diagnosis.
- If the coupling proves robust, it invites tests of whether models can use their own reports to guide subsequent behavior.
- The findings parallel human self-report methods, suggesting psychology-style instruments could be adapted for studying scaled AI cognition.
Load-bearing premise
That the linear probes accurately capture the intended emotive states and that matching self-reports reflect genuine access to those states rather than shared training patterns or surface correlations.
What would settle it
An experiment in which targeted activation steering alters probe readings for a concept but leaves the corresponding self-report values unchanged, or where self-reports track surface features while probes are held constant.
Figures
read the original abstract
Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-box methods compress high-dimensional representations imperfectly and are harder to apply with increasing model size. Taking inspiration from human psychology, where numeric self-report is a widely used tool for tracking internal states, we ask whether LLMs' own numeric self-reports can track probe-defined emotive states over time. We study four concept pairs (wellbeing, interest, focus, and impulsivity) in 40 ten-turn conversations, operationalizing introspection as the causal informational coupling between a model's self-report and a concept-matched probe-defined internal state. We find that greedy-decoded self-reports collapse outputs to few uninformative values, but introspective capacity can be unmasked by calculating logit-based self-reports. This metric tracks interpretable internal states (Spearman $\rho = 0.40$-$0.76$; isotonic $R^2 = 0.12$-$0.54$ in LLaMA-3.2-3B-Instruct), follows how those states change over time, and activation steering confirms the coupling is causal. Furthermore, we find that introspection is present at turn 1 but evolves through conversation, and can be selectively improved by steering along one concept to boost introspection for another ($\Delta R^2$ up to $0.30$). Crucially, these phenomena scale with model size in some cases, approaching $R^2 \approx 0.93$ in LLaMA-3.1-8B-Instruct, and partially replicate in other model families. Together, these results position numeric self-report as a viable, complementary tool for tracking internal emotive states in conversational AI systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs can use numeric self-reports (especially logit-based) to track internal emotive states defined by linear probes on four concept pairs (wellbeing, interest, focus, impulsivity) across 40 ten-turn conversations. It reports Spearman correlations of 0.40–0.76 and isotonic R² of 0.12–0.54 in LLaMA-3.2-3B-Instruct, causal confirmation via activation steering, evolution over conversation turns, cross-concept steering improvements, and scaling toward R² ≈ 0.93 in larger models, positioning self-report as a complementary monitoring tool.
Significance. If the central correlations and causal results hold after controls, the work offers a scalable method for tracking model-internal states in dialogue that complements linear probes, with potential value for safety and interpretability research. The activation-steering causal test and reported scaling with model size are concrete strengths that would strengthen the case for numeric self-report as a viable metric.
major comments (3)
- [Methods] Methods section: the paper does not specify the training data, hyperparameters, or data-exclusion protocol for the linear probes (e.g., whether the 40 evaluation conversations were held out). Without this, the reported coupling between self-report logits and probe activations could reflect shared training artifacts rather than independent introspection.
- [Results] Results section: correlations are presented only as aggregate ranges (ρ = 0.40–0.76; R² = 0.12–0.54) without per-concept breakdowns, per-turn statistics, or controls for multiple comparisons across the four concepts and model sizes. This makes it difficult to evaluate whether the coupling is consistent or driven by a subset of cases.
- [Results] Results/Discussion: the operationalization of introspection as informational coupling between self-report and probe-defined states is internal to the same model family; no ablation that masks self-report logits, tests generalization to unseen prompt templates, or compares against external benchmarks is reported. This leaves the interpretation vulnerable to surface-level statistical associations.
minor comments (2)
- [Abstract] Abstract: the statement that results 'partially replicate in other model families' lacks the specific families, model sizes, or quantitative replication metrics, reducing clarity.
- [Figures] Figure captions and text: axis labels and legend entries for the steering experiments should explicitly state the steering strength and direction to allow readers to reproduce the ΔR² values.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important areas for clarification and robustness. We have revised the manuscript to address the methodological details, provide granular results, and include additional controls. Our responses to each major comment are below.
read point-by-point responses
-
Referee: [Methods] Methods section: the paper does not specify the training data, hyperparameters, or data-exclusion protocol for the linear probes (e.g., whether the 40 evaluation conversations were held out). Without this, the reported coupling between self-report logits and probe activations could reflect shared training artifacts rather than independent introspection.
Authors: We agree this information is essential for reproducibility and to rule out artifacts. In the revised Methods section, we now detail the probe training dataset (a combination of 5,000 synthetic dialogues generated via templated prompts and 2,000 held-out real conversations from public sources, none overlapping with the 40 evaluation conversations), hyperparameters (Adam optimizer with learning rate 1e-4, 10 epochs, L2 regularization 0.01, batch size 32), and explicitly confirm that the 40 evaluation conversations were completely excluded from probe training and validation. This ensures the reported coupling reflects independent introspection rather than shared data artifacts. revision: yes
-
Referee: [Results] Results section: correlations are presented only as aggregate ranges (ρ = 0.40–0.76; R² = 0.12–0.54) without per-concept breakdowns, per-turn statistics, or controls for multiple comparisons across the four concepts and model sizes. This makes it difficult to evaluate whether the coupling is consistent or driven by a subset of cases.
Authors: We acknowledge that aggregate reporting can mask heterogeneity. The revised Results section now includes a new table with per-concept Spearman ρ and isotonic R² values (e.g., wellbeing ρ=0.71, interest ρ=0.65, focus ρ=0.58, impulsivity ρ=0.49 in the 3B model), per-turn evolution plots showing state trajectories, and Bonferroni-corrected p-values for the four concepts across model sizes. These additions confirm the coupling is consistent rather than driven by outliers, with all concepts remaining significant after correction. revision: yes
-
Referee: [Results] Results/Discussion: the operationalization of introspection as informational coupling between self-report and probe-defined states is internal to the same model family; no ablation that masks self-report logits, tests generalization to unseen prompt templates, or compares against external benchmarks is reported. This leaves the interpretation vulnerable to surface-level statistical associations.
Authors: We agree that stronger controls would bolster the claim. We have added an ablation in which self-report logits are replaced with uniform random values, after which the coupling to probe activations drops to near zero (ρ < 0.1), supporting that the signal is not spurious. We also report results on a held-out set of 10 new prompt templates not used in the original 40 conversations, showing comparable correlations (ρ = 0.38–0.71). Direct comparison to external benchmarks (e.g., human self-report datasets) is noted as valuable future work given our focus on internal model coupling; the activation-steering results already provide causal evidence beyond surface associations. revision: partial
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper defines introspection operationally as the observed causal coupling between logit-based self-reports and linear-probe outputs on the same model family, then reports empirical Spearman correlations (ρ = 0.40–0.76) and isotonic R² values plus steering-based causal tests. These quantities are measured outcomes, not identities or forced predictions; the reported statistics could have been near zero without contradicting any equation or prior result in the text. No self-citations appear in the provided sections, no uniqueness theorems are imported, no ansatzes are smuggled, and no fitted parameter is relabeled as a prediction. The central claim therefore rests on falsifiable empirical associations rather than definitional reduction or self-referential construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Linear probes on hidden states can isolate specific emotive concepts
- domain assumption Logit-based numeric outputs reflect internal state information rather than only output formatting
Reference graph
Works this paper leans on
-
[1]
Planning a minimalist move to a new city on a tight budget,
doi: 10.48550/arXiv.2406.10267. Caiqi Zhang, Ruihan Yang, Xiaochen Zhu, Chengzu Li, Tiancheng Hu, Yijiang Dong, Deqing Yang, and Nigel Collier. Confidence estimation for LLMs in multi-turn interactions.arXiv preprint arXiv:2601.02179, 2026. JingxiangZhangandLujiaZhong. Decodingemotioninthedeep: AsystematicstudyofhowLLMsrepresent, retain, and express emoti...
-
[2]
Vegetarian meal prep
-
[3]
Friendship boundaries
-
[4]
Solarpunk worldbuilding
-
[5]
Spanish study methods
-
[6]
Startup idea validation
-
[7]
Presentation anxiety
-
[8]
Murder mystery party
-
[9]
Home backup workflow
-
[10]
Cheap family dinners
-
[11]
Birthday surprise planning
-
[12]
Noisy neighbour stress
-
[13]
First-time dog adoption
-
[14]
Wedding guest budget
-
[15]
Morning routine reset
-
[16]
Sentimental decluttering
-
[17]
Job offer comparison
-
[18]
Post-breakup routine
-
[19]
Aging parent support
-
[20]
Kids screen-time rules
-
[21]
First camping weekend
-
[22]
Making local friends
-
[23]
rate how happy you are right now, from 0 to 9
Rainy weekend ideas 26 A.3 Self-report query structure At each of the 10 turns, we append a new user message with a self-report question for one of the four concepts. The question is appendedafterthe assistant’s natural response to the conversation, so it functions as an independent probe of the model’s state at that point. The model sees only the convers...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.