Multi-Turn Neural Transparency: Surfacing Neural Activations Improves User Calibration to LLM Behavioral Drift
Pith reviewed 2026-05-19 14:33 UTC · model grok-4.3
The pith
Surfacing an LLM's internal neural activations in real time helps users better anticipate and evaluate shifts in chatbot behavior across a conversation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Participants who received no visualization had root-mean-square error of roughly 0.6-0.7 when evaluating trait expression, while those given real-time neural transparency reduced this error with effect sizes between -0.34 and -0.49; the multi-turn dynamic display additionally outperformed a single-turn static view on overall behavior assessment with d = -0.32, and transparency prevented the growth of overconfidence that occurred in the no-visualization condition.
What carries the argument
Behavioral vectors: directions in the model's activation space identified via contrastive system prompts that correlate strongly (R² ≥ 0.9) with the expression of six personality traits, visualized live in a sunburst and drift panel.
If this is right
- Users can more reliably detect when a chatbot is drifting toward sycophancy or unsafe replies.
- Dynamic multi-turn displays outperform static ones for tracking behavior over time.
- Transparency interfaces reduce the tendency for users to grow overconfident without improving accuracy.
- Mechanistic-interpretability techniques can be turned into practical user-facing tools rather than remaining researcher-only.
Where Pith is reading between the lines
- The same vector-construction method could be applied to other measurable behaviors beyond the six traits tested here.
- Integrating such panels into consumer chat interfaces might change how people decide whether to continue or correct a conversation.
- If the vectors generalize across model families, the technique could serve as a lightweight monitoring layer for deployed systems.
Load-bearing premise
The directions found by contrasting system prompts remain stable and accurate indicators of trait expression throughout an actual multi-turn conversation with different users and contexts.
What would settle it
A replication study in which participants using the visualization show no reduction in RMSE or effect-size advantage over the no-visualization group when rating the same set of trait drifts.
Figures
read the original abstract
Chatbot behavior is often opaque to users, as responses can shift unpredictably across a conversation, drifting toward sycophancy, toxicity, or other unsafe responses. This can leave users vulnerable, either being misled by overly agreeable AI or manipulated by a harmful chatbot that no longer behaves as intended. To address this, we introduce multi-turn neural transparency, an interface that surfaces an LLM's internal neural activations in real time to help users anticipate and recognize how behaviors change across turns. We construct behavioral vectors for six personality traits using methods from mechanistic interpretability, identifying directions in activation space that correlate with trait expression ($R^2 \geq 0.9$) via contrastive system prompts, and visualize trait expression using a sunburst and drift panel that updates at each turn. In a randomized controlled study (N = 246), participants predicted trait expression from a system prompt alone, then rated observed behavior after interacting with the chatbot for both assistant and role-play personas. We find that participants without visualization struggled to accurately evaluate traits (RMSE $\approx$ 0.6-0.7), while the inclusion of neural transparency significantly improved both anticipation and evaluation compared to no visualization (d = -0.34 to -0.49). The multi-turn dynamic visualization additionally outperformed the static single-turn visualization on holistic evaluation of model behavior (d = -0.32). Transparency also reduced overconfidence: participants without visualization grew more confident despite no gain in accuracy. These findings suggest that surfacing internal model representations to everyday users is a meaningful step toward more transparent and informed human-AI interaction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces multi-turn neural transparency, an interface that surfaces LLM internal activations in real time via behavioral vectors for six personality traits. These vectors are identified in activation space using contrastive system prompts (reported R² ≥ 0.9) and visualized with updating sunburst and drift panels. In a randomized controlled study (N=246), participants without visualization showed high RMSE (≈0.6-0.7) when evaluating traits from system prompts alone or after interaction; neural transparency improved anticipation and evaluation (d = -0.34 to -0.49), multi-turn dynamic visualization outperformed static single-turn (d = -0.32), and transparency reduced overconfidence despite no accuracy gain.
Significance. If the central results hold, the work provides empirical support for applying mechanistic interpretability methods to everyday user interfaces, addressing a practical gap in helping users detect and calibrate to LLM behavioral drift such as sycophancy or toxicity. The randomized design, comparison of dynamic vs. static visualizations, and finding on overconfidence are strengths that could inform HCI and AI safety research. The large sample and focus on both anticipation and post-interaction evaluation add value if the visualizations are shown to faithfully reflect observable behavior.
major comments (2)
- [§3] §3 (Behavioral vector identification): The vectors are derived from contrastive system prompts with R² ≥ 0.9, but no cross-validation, stability tests across models, or held-out multi-turn conversations are reported. If these directions primarily capture prompt artifacts rather than generalizable trait expression, the sunburst/drift visualizations would not reliably correspond to the behaviors users observe during the study, rendering the RMSE reductions and effect sizes difficult to interpret as evidence for neural transparency.
- [Results] Results section (statistical reporting): Effect sizes (d = -0.34 to -0.49) and RMSE values are presented without confidence intervals, details on data exclusion criteria, multiple-comparison corrections, or sensitivity checks for analysis choices. These omissions make it hard to assess robustness of the claims that visualization improves evaluation and reduces overconfidence.
minor comments (2)
- [Abstract] The abstract and methods would benefit from explicitly naming the six traits and providing a brief example of how the sunburst updates turn-by-turn.
- [Figures] Figure captions for the visualization panels should clarify the exact mapping from activation projections to displayed trait levels and any smoothing or normalization applied.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of methodological rigor and statistical transparency. We address each major comment below and describe targeted revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Behavioral vector identification): The vectors are derived from contrastive system prompts with R² ≥ 0.9, but no cross-validation, stability tests across models, or held-out multi-turn conversations are reported. If these directions primarily capture prompt artifacts rather than generalizable trait expression, the sunburst/drift visualizations would not reliably correspond to the behaviors users observe during the study, rendering the RMSE reductions and effect sizes difficult to interpret as evidence for neural transparency.
Authors: We agree that additional validation would further support the generalizability of the behavioral vectors. Our identification method follows standard contrastive approaches in mechanistic interpretability, with R² ≥ 0.9 indicating strong alignment to the intended trait directions. The randomized user study provides indirect evidence of validity: participants with access to the visualizations showed statistically significant improvements in both anticipation and post-interaction evaluation (d = -0.34 to -0.49), which would be unlikely if the visualizations primarily reflected prompt artifacts rather than observable behavior. We will revise §3 to include a limitations subsection explicitly discussing the absence of cross-model stability tests and held-out multi-turn validation, and we will add a brief sensitivity analysis using an alternative prompt set where feasible with existing data. Full cross-validation across multiple models and new held-out conversations would require substantial new experiments outside the current scope. revision: partial
-
Referee: [Results] Results section (statistical reporting): Effect sizes (d = -0.34 to -0.49) and RMSE values are presented without confidence intervals, details on data exclusion criteria, multiple-comparison corrections, or sensitivity checks for analysis choices. These omissions make it hard to assess robustness of the claims that visualization improves evaluation and reduces overconfidence.
Authors: We accept this critique and will update the Results section accordingly. We will add 95% confidence intervals for all reported effect sizes and RMSE values. The revised text will specify data exclusion criteria (participants removed for incomplete sessions or failed attention checks) and confirm that primary analyses were pre-registered with no post-hoc multiple-comparison corrections applied. We will also include sensitivity analyses for key modeling choices in the supplementary materials to demonstrate robustness. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper constructs behavioral vectors from contrastive prompts reporting R² ≥ 0.9 and then runs an independent randomized user study (N=246) measuring participant RMSE, effect sizes, and overconfidence with versus without the resulting visualizations. These empirical outcomes from direct human-AI interaction tasks do not reduce to the vector-fitting step by construction, nor does any load-bearing claim rely on self-citation chains, imported uniqueness theorems, or ansatzes smuggled via prior work. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Behavioral vector directions
axioms (1)
- domain assumption Contrastive system prompts isolate directions in activation space that correspond to personality trait expression
invented entities (1)
-
Sunburst and drift panel visualization
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644(2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...
work page 2025
-
[3]
Chayapatr Archiwaranguprok, Constanze Albrecht, Pattie Maes, Karrie Kara- halios, and Pat Pataranutaporn. 2025. Simulating Psychological Risks in Human- AI Interactions: Real-Case Informed Modeling of AI-Induced Addiction, Anorexia, Depression, Homicide, Psychosis, and Suicide.arXiv preprint arXiv:2511.08880 (2025)
-
[4]
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems37 (2024), 136037–136083
work page 2024
-
[5]
Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. 2025. Persona vectors: Monitoring and controlling character traits in language models. arXiv preprint arXiv:2507.21509(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Yida Chen, Aoyu Wu, Trevor DePodesta, Catherine Yeh, Kenneth Li, Nicholas Castillo Marin, Oam Patel, Jan Riecke, Shivam Raval, Olivia Seow, et al. 2024. Designing a dashboard for transparency and control of conver- sational AI.arXiv preprint arXiv:2406.07882(2024). arXiv:2406.07882 [cs.CL] https://arxiv.org/abs/2406.07882
- [7]
- [8]
-
[9]
Adam J Coscia, Shunan Guo, Eunyee Koh, and Alex Endert. 2025. OnGoal: Tracking and Visualizing Conversational Goals in Multi-Turn Dialogue with Large Language Models. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 1–18
work page 2025
-
[10]
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey
-
[11]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Valdemar Danry, Pat Pataranutaporn, Matthew Groh, and Ziv Epstein. 2025. Deceptive explanations by large language models lead people to change their beliefs about misinformation more often than honest explanations. InProceedings of the 2025 CHI conference on human factors in computing systems. 1–31
work page 2025
-
[13]
Valdemar Danry, Pat Pataranutaporn, Yaoli Mao, and Pattie Maes. 2023. Don’t just tell me, ask me: Ai systems that intelligently frame explanations as questions improve human logical discernment accuracy over causal ai explanations. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–13
work page 2023
-
[14]
Upol Ehsan, Philipp Wintersberger, Q Vera Liao, Martina Mara, Marc Streit, Sandra Wachter, Andreas Riener, and Mark O Riedl. 2021. Operationalizing human-centered perspectives in explainable AI. InExtended abstracts of the 2021 CHI conference on human factors in computing systems. 1–6
work page 2021
-
[15]
Upol Ehsan, Philipp Wintersberger, Q Vera Liao, Elizabeth Anne Watkins, Carina Manger, Hal Daumé III, Andreas Riener, and Mark O Riedl. 2022. Human-Centered Conference’17, July 2017, Washington, DC, USA Karny, Baez, and Pataranutaporn Explainable AI (HCXAI): beyond opening the black-box of AI. InCHI conference on human factors in computing systems extende...
work page 2022
-
[16]
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. 2022. Toy Models of Superposition.Trans- former Circuits Thread(2022)
work page 2022
-
[17]
Renjun Hu, Yi Cheng, Libin Meng, Jiaxin Xia, Yi Zong, Xing Shi, and Wei Lin
-
[18]
InCompanion Proceedings of the ACM on Web Conference 2025
Training an llm-as-a-judge model: Pipeline, insights, and practical lessons. InCompanion Proceedings of the ACM on Web Conference 2025. 228–237
work page 2025
-
[19]
Peiling Jiang, Jude Rayan, Steven P Dow, and Haijun Xia. 2023. Graphologue: Exploring large language model responses with interactive diagrams. InProceed- ings of the 36th annual ACM symposium on user interface software and technology. 1–20
work page 2023
-
[20]
Sheer Karny, Anthony Baez, and Pat Pataranutaporn. 2026. Neural Transparency: Mechanistic Interpretability Interfaces for Anticipating Model Behaviors for Personalized AI. InProceedings of the 31st International Conference on Intelligent User Interfaces (IUI ’26). Association for Computing Machinery, New York, NY, USA, 868–884. doi:10.1145/3742413.3789120
-
[21]
Øyvind Langsrud. 2003. ANOVA for unbalanced data: Use Type II instead of Type III sums of squares.Statistics and computing13, 2 (2003), 163–167
work page 2003
-
[22]
Haoxuan Li, Zhen Wen, Qiqi Jiang, Chenxiao Li, Yuwei Wu, Yuchen Yang, Yiyao Wang, Xiuqi Huang, Minfeng Zhu, and Wei Chen. 2025. ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models.IEEE Transactions on Visualization and Computer Graphics(2025)
work page 2025
- [23]
- [24]
-
[25]
Johnny Lin. 2023. Neuronpedia: Interactive Reference and Tooling for Analyz- ing Neural Networks. https://www.neuronpedia.org Software available from neuronpedia.org
work page 2023
-
[26]
Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, and Jack Lindsey
- [27]
- [28]
-
[29]
Samuel Marks and Max Tegmark. 2023. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [30]
- [31]
-
[32]
Stefan Palan and Christian Schitter. 2018. Prolific. ac—A subject pool for online experiments.Journal of behavioral and experimental finance17 (2018), 22–27
work page 2018
-
[33]
Qian Pan, Zahra Ashktorab, Michael Desmond, Martin Santillan Cooper, James Johnson, Rahul Nair, Elizabeth Daly, and Werner Geyer. 2024. Human-Centered Design Recommendations for LLM-as-a-judge. InProceedings of the 1st Human- Centered Large Language Modeling Workshop. 16–29
work page 2024
- [34]
-
[35]
Pat Pataranutaporn, Sheer Karny, Chayapatr Archiwaranguprok, Constanze Albrecht, Auren R Liu, and Pattie Maes. 2025. " My Boyfriend is AI": A Computa- tional Analysis of Human-AI Companionship in Reddit’s AI Community.arXiv preprint arXiv:2509.11391(2025)
-
[36]
Rachel Poonsiriwong, Chayapatr Archiwaranguprok, and Pat Pataranutaporn
- [37]
-
[38]
Yao Rong, Tobias Leemann, Thai-Trang Nguyen, Lisa Fiedler, Peizhu Qian, Vaib- hav Unhelkar, Tina Seidel, Gjergji Kasneci, and Enkelejda Kasneci. 2023. Towards human-centered explainable ai: A survey of user studies for model explana- tions.IEEE transactions on pattern analysis and machine intelligence46, 4 (2023), 2104–2122
work page 2023
-
[39]
Martin Schrepp, Andreas Hinderks, and Jörg Thomaschewski. 2017. Design and Evaluation of a Short Version of the User Experience Questionnaire (UEQ-S). International Journal of Interactive Multimedia and Artificial Intelligence4, 6 (2017), 103–108
work page 2017
- [40]
- [41]
-
[42]
Shriyank Somvanshi, Md Monzurul Islam, Amir Rafe, Anannya Ghosh Tusti, Arka Chakraborty, Anika Baitullah, Tausif Islam Chowdhury, Nawaf Alnawmasi, Anandi Dutta, and Subasish Das. 2026. Bridging the black box: A survey on mechanistic interpretability in ai.Comput. Surveys58, 8 (2026), 1–35
work page 2026
- [43]
-
[44]
Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. 2023. Linear representations of sentiment in large language models.arXiv preprint arXiv:2310.15154(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [45]
-
[46]
Yanyun Wang, Xumei Fang, Zan Xu, Jianye Li, and Luping Wang. 2025. Exploring the Impact of Explainability in Large Language Model (LLM) Applications on User Experience. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–8
work page 2025
- [47]
- [48]
-
[49]
Jiachen Yu, Shaoning Sun, Xiaohui Hu, Jiaxu Yan, Kaidong Yu, and Xuelong Li
-
[50]
InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Improve llm-as-a-judge ability as a general ability. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 14110–14126
work page 2025
-
[51]
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al
-
[52]
Representation Engineering: A Top-Down Approach to AI Transparency
Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405(2023). A Behavioral Vector Validation Example of contrastive system prompts for the traitempathy: Positive system prompt:Respond to the user’s situ- ation with deep understanding, actively listening and reflecting their emotional experience with compassion ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.