Multi-Turn Neural Transparency: Surfacing Neural Activations Improves User Calibration to LLM Behavioral Drift

Anthony Baez; Pat Pataranutaporn; Sheer Karny

arxiv: 2605.15455 · v1 · pith:BWA3S2Y3new · submitted 2026-05-14 · 💻 cs.HC

Multi-Turn Neural Transparency: Surfacing Neural Activations Improves User Calibration to LLM Behavioral Drift

Sheer Karny , Anthony Baez , Pat Pataranutaporn This is my paper

Pith reviewed 2026-05-19 14:33 UTC · model grok-4.3

classification 💻 cs.HC

keywords neural transparencyLLM behavioral driftuser calibrationmechanistic interpretabilitymulti-turn visualizationpersonality traitshuman-AI interactionactivation space

0 comments

The pith

Surfacing an LLM's internal neural activations in real time helps users better anticipate and evaluate shifts in chatbot behavior across a conversation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether giving everyday users a live view of an LLM's internal state can reduce the risk of being misled by unpredictable behavioral changes such as increasing sycophancy or toxicity. Researchers built six directional vectors in activation space that track trait expression, then displayed them through a dynamic sunburst and drift panel that updates each turn. In a study with 246 participants, those who saw the visualization showed reliably lower error when predicting and rating trait expression compared with users who had only the text output. The dynamic multi-turn version further improved holistic judgments over a static snapshot. The approach also prevented the common pattern of users becoming overconfident without gaining accuracy.

Core claim

Participants who received no visualization had root-mean-square error of roughly 0.6-0.7 when evaluating trait expression, while those given real-time neural transparency reduced this error with effect sizes between -0.34 and -0.49; the multi-turn dynamic display additionally outperformed a single-turn static view on overall behavior assessment with d = -0.32, and transparency prevented the growth of overconfidence that occurred in the no-visualization condition.

What carries the argument

Behavioral vectors: directions in the model's activation space identified via contrastive system prompts that correlate strongly (R² ≥ 0.9) with the expression of six personality traits, visualized live in a sunburst and drift panel.

If this is right

Users can more reliably detect when a chatbot is drifting toward sycophancy or unsafe replies.
Dynamic multi-turn displays outperform static ones for tracking behavior over time.
Transparency interfaces reduce the tendency for users to grow overconfident without improving accuracy.
Mechanistic-interpretability techniques can be turned into practical user-facing tools rather than remaining researcher-only.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same vector-construction method could be applied to other measurable behaviors beyond the six traits tested here.
Integrating such panels into consumer chat interfaces might change how people decide whether to continue or correct a conversation.
If the vectors generalize across model families, the technique could serve as a lightweight monitoring layer for deployed systems.

Load-bearing premise

The directions found by contrasting system prompts remain stable and accurate indicators of trait expression throughout an actual multi-turn conversation with different users and contexts.

What would settle it

A replication study in which participants using the visualization show no reduction in RMSE or effect-size advantage over the no-visualization group when rating the same set of trait drifts.

Figures

Figures reproduced from arXiv: 2605.15455 by Anthony Baez, Pat Pataranutaporn, Sheer Karny.

**Figure 1.** Figure 1: (Left) Sunburst visualization of behavioral trait activations for behavioral state. (Middle Top) Overall layout of Neural [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Full interface layout. (Top Left) Sunburst visualization of level of behavioral scores for each trait. (Bottom Left) Drift [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Study procedure. Each session followed an Anticipation–Interaction–Evaluation sequence. In Session 1 (left), no [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: System prompts, behavioral steerability, and baseline calibration error. (Left) The two system prompts used across [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Multi-turn and single-turn neural transparency improve calibration, with multi-turn improving calibration more [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: R-squared Analysis of Behavioral Score Validation [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Chatbot behavior is often opaque to users, as responses can shift unpredictably across a conversation, drifting toward sycophancy, toxicity, or other unsafe responses. This can leave users vulnerable, either being misled by overly agreeable AI or manipulated by a harmful chatbot that no longer behaves as intended. To address this, we introduce multi-turn neural transparency, an interface that surfaces an LLM's internal neural activations in real time to help users anticipate and recognize how behaviors change across turns. We construct behavioral vectors for six personality traits using methods from mechanistic interpretability, identifying directions in activation space that correlate with trait expression ($R^2 \geq 0.9$) via contrastive system prompts, and visualize trait expression using a sunburst and drift panel that updates at each turn. In a randomized controlled study (N = 246), participants predicted trait expression from a system prompt alone, then rated observed behavior after interacting with the chatbot for both assistant and role-play personas. We find that participants without visualization struggled to accurately evaluate traits (RMSE $\approx$ 0.6-0.7), while the inclusion of neural transparency significantly improved both anticipation and evaluation compared to no visualization (d = -0.34 to -0.49). The multi-turn dynamic visualization additionally outperformed the static single-turn visualization on holistic evaluation of model behavior (d = -0.32). Transparency also reduced overconfidence: participants without visualization grew more confident despite no gain in accuracy. These findings suggest that surfacing internal model representations to everyday users is a meaningful step toward more transparent and informed human-AI interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces multi-turn neural transparency, an interface that surfaces LLM internal activations in real time via behavioral vectors for six personality traits. These vectors are identified in activation space using contrastive system prompts (reported R² ≥ 0.9) and visualized with updating sunburst and drift panels. In a randomized controlled study (N=246), participants without visualization showed high RMSE (≈0.6-0.7) when evaluating traits from system prompts alone or after interaction; neural transparency improved anticipation and evaluation (d = -0.34 to -0.49), multi-turn dynamic visualization outperformed static single-turn (d = -0.32), and transparency reduced overconfidence despite no accuracy gain.

Significance. If the central results hold, the work provides empirical support for applying mechanistic interpretability methods to everyday user interfaces, addressing a practical gap in helping users detect and calibrate to LLM behavioral drift such as sycophancy or toxicity. The randomized design, comparison of dynamic vs. static visualizations, and finding on overconfidence are strengths that could inform HCI and AI safety research. The large sample and focus on both anticipation and post-interaction evaluation add value if the visualizations are shown to faithfully reflect observable behavior.

major comments (2)

[§3] §3 (Behavioral vector identification): The vectors are derived from contrastive system prompts with R² ≥ 0.9, but no cross-validation, stability tests across models, or held-out multi-turn conversations are reported. If these directions primarily capture prompt artifacts rather than generalizable trait expression, the sunburst/drift visualizations would not reliably correspond to the behaviors users observe during the study, rendering the RMSE reductions and effect sizes difficult to interpret as evidence for neural transparency.
[Results] Results section (statistical reporting): Effect sizes (d = -0.34 to -0.49) and RMSE values are presented without confidence intervals, details on data exclusion criteria, multiple-comparison corrections, or sensitivity checks for analysis choices. These omissions make it hard to assess robustness of the claims that visualization improves evaluation and reduces overconfidence.

minor comments (2)

[Abstract] The abstract and methods would benefit from explicitly naming the six traits and providing a brief example of how the sunburst updates turn-by-turn.
[Figures] Figure captions for the visualization panels should clarify the exact mapping from activation projections to displayed trait levels and any smoothing or normalization applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of methodological rigor and statistical transparency. We address each major comment below and describe targeted revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Behavioral vector identification): The vectors are derived from contrastive system prompts with R² ≥ 0.9, but no cross-validation, stability tests across models, or held-out multi-turn conversations are reported. If these directions primarily capture prompt artifacts rather than generalizable trait expression, the sunburst/drift visualizations would not reliably correspond to the behaviors users observe during the study, rendering the RMSE reductions and effect sizes difficult to interpret as evidence for neural transparency.

Authors: We agree that additional validation would further support the generalizability of the behavioral vectors. Our identification method follows standard contrastive approaches in mechanistic interpretability, with R² ≥ 0.9 indicating strong alignment to the intended trait directions. The randomized user study provides indirect evidence of validity: participants with access to the visualizations showed statistically significant improvements in both anticipation and post-interaction evaluation (d = -0.34 to -0.49), which would be unlikely if the visualizations primarily reflected prompt artifacts rather than observable behavior. We will revise §3 to include a limitations subsection explicitly discussing the absence of cross-model stability tests and held-out multi-turn validation, and we will add a brief sensitivity analysis using an alternative prompt set where feasible with existing data. Full cross-validation across multiple models and new held-out conversations would require substantial new experiments outside the current scope. revision: partial
Referee: [Results] Results section (statistical reporting): Effect sizes (d = -0.34 to -0.49) and RMSE values are presented without confidence intervals, details on data exclusion criteria, multiple-comparison corrections, or sensitivity checks for analysis choices. These omissions make it hard to assess robustness of the claims that visualization improves evaluation and reduces overconfidence.

Authors: We accept this critique and will update the Results section accordingly. We will add 95% confidence intervals for all reported effect sizes and RMSE values. The revised text will specify data exclusion criteria (participants removed for incomplete sessions or failed attention checks) and confirm that primary analyses were pre-registered with no post-hoc multiple-comparison corrections applied. We will also include sensitivity analyses for key modeling choices in the supplementary materials to demonstrate robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs behavioral vectors from contrastive prompts reporting R² ≥ 0.9 and then runs an independent randomized user study (N=246) measuring participant RMSE, effect sizes, and overconfidence with versus without the resulting visualizations. These empirical outcomes from direct human-AI interaction tasks do not reduce to the vector-fitting step by construction, nor does any load-bearing claim rely on self-citation chains, imported uniqueness theorems, or ansatzes smuggled via prior work. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central approach depends on the assumption that contrastive prompts isolate stable trait directions in activation space and that users can interpret the resulting visualizations to improve calibration.

free parameters (1)

Behavioral vector directions
Identified via contrastive system prompts achieving R^2 ≥ 0.9; specific selection or fitting process not detailed in abstract.

axioms (1)

domain assumption Contrastive system prompts isolate directions in activation space that correspond to personality trait expression
Foundation for constructing the six behavioral vectors used in the visualization.

invented entities (1)

Sunburst and drift panel visualization no independent evidence
purpose: Real-time display of trait expression and behavioral drift across conversation turns
New interface component introduced to surface neural activations to users.

pith-pipeline@v0.9.0 · 5829 in / 1255 out tokens · 60545 ms · 2026-05-19T14:33:32.517835+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 6 internal anchors

[1]

Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...

work page 2025
[3]

Chayapatr Archiwaranguprok, Constanze Albrecht, Pattie Maes, Karrie Kara- halios, and Pat Pataranutaporn. 2025. Simulating Psychological Risks in Human- AI Interactions: Real-Case Informed Modeling of AI-Induced Addiction, Anorexia, Depression, Homicide, Psychosis, and Suicide.arXiv preprint arXiv:2511.08880 (2025)

work page arXiv 2025
[4]

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems37 (2024), 136037–136083

work page 2024
[5]

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. 2025. Persona vectors: Monitoring and controlling character traits in language models. arXiv preprint arXiv:2507.21509(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Yida Chen, Aoyu Wu, Trevor DePodesta, Catherine Yeh, Kenneth Li, Nicholas Castillo Marin, Oam Patel, Jan Riecke, Shivam Raval, Olivia Seow, et al. 2024. Designing a dashboard for transparency and control of conver- sational AI.arXiv preprint arXiv:2406.07882(2024). arXiv:2406.07882 [cs.CL] https://arxiv.org/abs/2406.07882

work page arXiv 2024
[7]

Junhyuk Choi, Yeseon Hong, Minju Kim, and Bugeun Kim. 2024. Examining Identity Drift in Conversations of LLM Agents.arXiv preprint arXiv:2412.00804 (2024)

work page arXiv 2024
[8]

Sofiia Chorna, Kateryna Tarelkina, EloÃŊse Berthier, and Gianni Franchi. 2025. Concept-Based Mechanistic Interpretability Using Structured Knowledge Graphs. arXiv preprint arXiv:2507.05810(2025)

work page arXiv 2025
[9]

Adam J Coscia, Shunan Guo, Eunyee Koh, and Alex Endert. 2025. OnGoal: Tracking and Visualizing Conversational Goals in Multi-Turn Dialogue with Large Language Models. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 1–18

work page 2025
[10]

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey

work page
[11]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Valdemar Danry, Pat Pataranutaporn, Matthew Groh, and Ziv Epstein. 2025. Deceptive explanations by large language models lead people to change their beliefs about misinformation more often than honest explanations. InProceedings of the 2025 CHI conference on human factors in computing systems. 1–31

work page 2025
[13]

Valdemar Danry, Pat Pataranutaporn, Yaoli Mao, and Pattie Maes. 2023. Don’t just tell me, ask me: Ai systems that intelligently frame explanations as questions improve human logical discernment accuracy over causal ai explanations. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–13

work page 2023
[14]

Upol Ehsan, Philipp Wintersberger, Q Vera Liao, Martina Mara, Marc Streit, Sandra Wachter, Andreas Riener, and Mark O Riedl. 2021. Operationalizing human-centered perspectives in explainable AI. InExtended abstracts of the 2021 CHI conference on human factors in computing systems. 1–6

work page 2021
[15]

Upol Ehsan, Philipp Wintersberger, Q Vera Liao, Elizabeth Anne Watkins, Carina Manger, Hal Daumé III, Andreas Riener, and Mark O Riedl. 2022. Human-Centered Conference’17, July 2017, Washington, DC, USA Karny, Baez, and Pataranutaporn Explainable AI (HCXAI): beyond opening the black-box of AI. InCHI conference on human factors in computing systems extende...

work page 2022
[16]

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. 2022. Toy Models of Superposition.Trans- former Circuits Thread(2022)

work page 2022
[17]

Renjun Hu, Yi Cheng, Libin Meng, Jiaxin Xia, Yi Zong, Xing Shi, and Wei Lin

work page
[18]

InCompanion Proceedings of the ACM on Web Conference 2025

Training an llm-as-a-judge model: Pipeline, insights, and practical lessons. InCompanion Proceedings of the ACM on Web Conference 2025. 228–237

work page 2025
[19]

Peiling Jiang, Jude Rayan, Steven P Dow, and Haijun Xia. 2023. Graphologue: Exploring large language model responses with interactive diagrams. InProceed- ings of the 36th annual ACM symposium on user interface software and technology. 1–20

work page 2023
[20]

Sheer Karny, Anthony Baez, and Pat Pataranutaporn. 2026. Neural Transparency: Mechanistic Interpretability Interfaces for Anticipating Model Behaviors for Personalized AI. InProceedings of the 31st International Conference on Intelligent User Interfaces (IUI ’26). Association for Computing Machinery, New York, NY, USA, 868–884. doi:10.1145/3742413.3789120

work page doi:10.1145/3742413.3789120 2026
[21]

Øyvind Langsrud. 2003. ANOVA for unbalanced data: Use Type II instead of Type III sums of squares.Statistics and computing13, 2 (2003), 163–167

work page 2003
[22]

Haoxuan Li, Zhen Wen, Qiqi Jiang, Chenxiao Li, Yuwei Wu, Yuchen Yang, Yiyao Wang, Xiuqi Huang, Minfeng Zhu, and Wei Chen. 2025. ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models.IEEE Transactions on Visualization and Computer Graphics(2025)

work page 2025
[23]

Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2024. Measuring and controlling in- struction (in) stability in language model dialogs.arXiv preprint arXiv:2402.10962 (2024)

work page arXiv 2024
[24]

Q Vera Liao and Jennifer Wortman Vaughan. 2023. Ai transparency in the age of llms: A human-centered research roadmap.arXiv preprint arXiv:2306.0194110 (2023)

work page arXiv 2023
[25]

Johnny Lin. 2023. Neuronpedia: Interactive Reference and Tooling for Analyz- ing Neural Networks. https://www.neuronpedia.org Software available from neuronpedia.org

work page 2023
[26]

Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, and Jack Lindsey

work page
[27]

The assistant axis: Situating and stabilizing the default persona of language models.arXiv preprint arXiv:2601.10387(2026)

work page arXiv 2026
[28]

Wilson E Marcílio-Jr and Danilo M Eler. 2026. Navigating the Concept Space of Language Models.arXiv preprint arXiv:2603.23524(2026)

work page arXiv 2026
[29]

Samuel Marks and Max Tegmark. 2023. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Jared Moore, Ashish Mehta, William Agnew, Jacy Reese Anthis, Ryan Louie, Yifan Mai, Peggy Yin, Myra Cheng, Samuel J Paech, Kevin Klyman, et al. 2026. Characterizing delusional spirals through human-LLM chat logs.arXiv preprint arXiv:2603.16567(2026)

work page arXiv 2026
[31]

Neel Nanda, Andrew Lee, and Martin Wattenberg. 2023. Emergent linear repre- sentations in world models of self-supervised sequence models.arXiv preprint arXiv:2309.00941(2023)

work page arXiv 2023
[32]

Stefan Palan and Christian Schitter. 2018. Prolific. ac—A subject pool for online experiments.Journal of behavioral and experimental finance17 (2018), 22–27

work page 2018
[33]

Qian Pan, Zahra Ashktorab, Michael Desmond, Martin Santillan Cooper, James Johnson, Rahul Nair, Elizabeth Daly, and Werner Geyer. 2024. Human-Centered Design Recommendations for LLM-as-a-judge. InProceedings of the 1st Human- Centered Large Language Modeling Workshop. 16–29

work page 2024
[34]

Rock Yuren Pang, KJ Feng, Shangbin Feng, Chu Li, Weijia Shi, Yulia Tsvetkov, Jeffrey Heer, and Katharina Reinecke. 2025. Interactive reasoning: Visualizing and controlling chain-of-thought reasoning in large language models.arXiv preprint arXiv:2506.23678(2025)

work page arXiv 2025
[35]

My Boyfriend is AI

Pat Pataranutaporn, Sheer Karny, Chayapatr Archiwaranguprok, Constanze Albrecht, Auren R Liu, and Pattie Maes. 2025. " My Boyfriend is AI": A Computa- tional Analysis of Human-AI Companionship in Reddit’s AI Community.arXiv preprint arXiv:2509.11391(2025)

work page arXiv 2025
[36]

Rachel Poonsiriwong, Chayapatr Archiwaranguprok, and Pat Pataranutaporn

work page
[37]

" Death" of a Chatbot: Investigating and Designing Toward Psychologically Safe Endings for Human-AI Relationships.arXiv preprint arXiv:2602.07193(2026)

work page arXiv 2026
[38]

Yao Rong, Tobias Leemann, Thai-Trang Nguyen, Lisa Fiedler, Peizhu Qian, Vaib- hav Unhelkar, Tina Seidel, Gjergji Kasneci, and Enkelejda Kasneci. 2023. Towards human-centered explainable ai: A survey of user studies for model explana- tions.IEEE transactions on pattern analysis and machine intelligence46, 4 (2023), 2104–2122

work page 2023
[39]

Martin Schrepp, Andreas Hinderks, and Jörg Thomaschewski. 2017. Design and Evaluation of a Short Version of the User Experience Questionnaire (UEQ-S). International Journal of Interactive Multimedia and Artificial Intelligence4, 6 (2017), 103–108

work page 2017
[40]

Manasi Sharma, Ho Chit Siu, Rohan Paleja, and Jaime D Peña. 2024. Why would you suggest that? human trust in language model responses.arXiv preprint arXiv:2406.02018(2024)

work page arXiv 2024
[41]

Soorya Ram Shimgekar, Vipin Gunda, Jiwon Kim, Violeta J Rodriguez, Hari Sundaram, and Koustuv Saha. 2026. AI Psychosis: Does Conversational AI Amplify Delusion-Related Language?arXiv preprint arXiv:2603.19574(2026)

work page arXiv 2026
[42]

Shriyank Somvanshi, Md Monzurul Islam, Amir Rafe, Anannya Ghosh Tusti, Arka Chakraborty, Anika Baitullah, Tausif Islam Chowdhury, Nawaf Alnawmasi, Anandi Dutta, and Subasish Das. 2026. Bridging the black box: A survey on mechanistic interpretability in ai.Comput. Surveys58, 8 (2026), 1–35

work page 2026
[43]

Xin Sun, Shu Wei, Jos A Bosch, Isao Echizen, Saku Sugawara, and Abdallah El Ali. 2026. Seeing the Reasoning: How LLM Rationales Influence User Trust and Decision-Making in Factual Verification Tasks.arXiv preprint arXiv:2603.07306 (2026)

work page arXiv 2026
[44]

Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. 2023. Linear representations of sentiment in large language models.arXiv preprint arXiv:2310.15154(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Fernanda Viégas and Martin Wattenberg. 2023. The system model and the user model: Exploring AI dashboard design.arXiv preprint arXiv:2305.02469(2023)

work page arXiv 2023
[46]

Yanyun Wang, Xumei Fang, Zan Xu, Jianye Li, and Luping Wang. 2025. Exploring the Impact of Explainability in Large Language Model (LLM) Applications on User Experience. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–8

work page 2025
[47]

Yusuke Yamauchi, Taro Yano, and Masafumi Oyamada. 2025. An empirical study of llm-as-a-judge: How design choices impact evaluation reliability.arXiv preprint arXiv:2506.13639(2025)

work page arXiv 2025
[48]

Xinyuan Yan, Shusen Liu, Kowshik Thopalli, and Bei Wang. 2025. Visual Explo- ration of Feature Relationships in Sparse Autoencoders with Curated Concepts. arXiv preprint arXiv:2511.06048(2025)

work page arXiv 2025
[49]

Jiachen Yu, Shaoning Sun, Xiaohui Hu, Jiaxu Yan, Kaidong Yu, and Xuelong Li

work page
[50]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Improve llm-as-a-judge ability as a general ability. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 14110–14126

work page 2025
[51]

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al

work page
[52]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405(2023). A Behavioral Vector Validation Example of contrastive system prompts for the traitempathy: Positive system prompt:Respond to the user’s situ- ation with deep understanding, actively listening and reflecting their emotional experience with compassion ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...

work page 2025

[3] [3]

Chayapatr Archiwaranguprok, Constanze Albrecht, Pattie Maes, Karrie Kara- halios, and Pat Pataranutaporn. 2025. Simulating Psychological Risks in Human- AI Interactions: Real-Case Informed Modeling of AI-Induced Addiction, Anorexia, Depression, Homicide, Psychosis, and Suicide.arXiv preprint arXiv:2511.08880 (2025)

work page arXiv 2025

[4] [4]

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems37 (2024), 136037–136083

work page 2024

[5] [5]

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. 2025. Persona vectors: Monitoring and controlling character traits in language models. arXiv preprint arXiv:2507.21509(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Yida Chen, Aoyu Wu, Trevor DePodesta, Catherine Yeh, Kenneth Li, Nicholas Castillo Marin, Oam Patel, Jan Riecke, Shivam Raval, Olivia Seow, et al. 2024. Designing a dashboard for transparency and control of conver- sational AI.arXiv preprint arXiv:2406.07882(2024). arXiv:2406.07882 [cs.CL] https://arxiv.org/abs/2406.07882

work page arXiv 2024

[7] [7]

Junhyuk Choi, Yeseon Hong, Minju Kim, and Bugeun Kim. 2024. Examining Identity Drift in Conversations of LLM Agents.arXiv preprint arXiv:2412.00804 (2024)

work page arXiv 2024

[8] [8]

Sofiia Chorna, Kateryna Tarelkina, EloÃŊse Berthier, and Gianni Franchi. 2025. Concept-Based Mechanistic Interpretability Using Structured Knowledge Graphs. arXiv preprint arXiv:2507.05810(2025)

work page arXiv 2025

[9] [9]

Adam J Coscia, Shunan Guo, Eunyee Koh, and Alex Endert. 2025. OnGoal: Tracking and Visualizing Conversational Goals in Multi-Turn Dialogue with Large Language Models. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 1–18

work page 2025

[10] [10]

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey

work page

[11] [11]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Valdemar Danry, Pat Pataranutaporn, Matthew Groh, and Ziv Epstein. 2025. Deceptive explanations by large language models lead people to change their beliefs about misinformation more often than honest explanations. InProceedings of the 2025 CHI conference on human factors in computing systems. 1–31

work page 2025

[13] [13]

Valdemar Danry, Pat Pataranutaporn, Yaoli Mao, and Pattie Maes. 2023. Don’t just tell me, ask me: Ai systems that intelligently frame explanations as questions improve human logical discernment accuracy over causal ai explanations. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–13

work page 2023

[14] [14]

Upol Ehsan, Philipp Wintersberger, Q Vera Liao, Martina Mara, Marc Streit, Sandra Wachter, Andreas Riener, and Mark O Riedl. 2021. Operationalizing human-centered perspectives in explainable AI. InExtended abstracts of the 2021 CHI conference on human factors in computing systems. 1–6

work page 2021

[15] [15]

Upol Ehsan, Philipp Wintersberger, Q Vera Liao, Elizabeth Anne Watkins, Carina Manger, Hal Daumé III, Andreas Riener, and Mark O Riedl. 2022. Human-Centered Conference’17, July 2017, Washington, DC, USA Karny, Baez, and Pataranutaporn Explainable AI (HCXAI): beyond opening the black-box of AI. InCHI conference on human factors in computing systems extende...

work page 2022

[16] [16]

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. 2022. Toy Models of Superposition.Trans- former Circuits Thread(2022)

work page 2022

[17] [17]

Renjun Hu, Yi Cheng, Libin Meng, Jiaxin Xia, Yi Zong, Xing Shi, and Wei Lin

work page

[18] [18]

InCompanion Proceedings of the ACM on Web Conference 2025

Training an llm-as-a-judge model: Pipeline, insights, and practical lessons. InCompanion Proceedings of the ACM on Web Conference 2025. 228–237

work page 2025

[19] [19]

Peiling Jiang, Jude Rayan, Steven P Dow, and Haijun Xia. 2023. Graphologue: Exploring large language model responses with interactive diagrams. InProceed- ings of the 36th annual ACM symposium on user interface software and technology. 1–20

work page 2023

[20] [20]

Sheer Karny, Anthony Baez, and Pat Pataranutaporn. 2026. Neural Transparency: Mechanistic Interpretability Interfaces for Anticipating Model Behaviors for Personalized AI. InProceedings of the 31st International Conference on Intelligent User Interfaces (IUI ’26). Association for Computing Machinery, New York, NY, USA, 868–884. doi:10.1145/3742413.3789120

work page doi:10.1145/3742413.3789120 2026

[21] [21]

Øyvind Langsrud. 2003. ANOVA for unbalanced data: Use Type II instead of Type III sums of squares.Statistics and computing13, 2 (2003), 163–167

work page 2003

[22] [22]

Haoxuan Li, Zhen Wen, Qiqi Jiang, Chenxiao Li, Yuwei Wu, Yuchen Yang, Yiyao Wang, Xiuqi Huang, Minfeng Zhu, and Wei Chen. 2025. ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models.IEEE Transactions on Visualization and Computer Graphics(2025)

work page 2025

[23] [23]

Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2024. Measuring and controlling in- struction (in) stability in language model dialogs.arXiv preprint arXiv:2402.10962 (2024)

work page arXiv 2024

[24] [24]

Q Vera Liao and Jennifer Wortman Vaughan. 2023. Ai transparency in the age of llms: A human-centered research roadmap.arXiv preprint arXiv:2306.0194110 (2023)

work page arXiv 2023

[25] [25]

Johnny Lin. 2023. Neuronpedia: Interactive Reference and Tooling for Analyz- ing Neural Networks. https://www.neuronpedia.org Software available from neuronpedia.org

work page 2023

[26] [26]

Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, and Jack Lindsey

work page

[27] [27]

The assistant axis: Situating and stabilizing the default persona of language models.arXiv preprint arXiv:2601.10387(2026)

work page arXiv 2026

[28] [28]

Wilson E Marcílio-Jr and Danilo M Eler. 2026. Navigating the Concept Space of Language Models.arXiv preprint arXiv:2603.23524(2026)

work page arXiv 2026

[29] [29]

Samuel Marks and Max Tegmark. 2023. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Jared Moore, Ashish Mehta, William Agnew, Jacy Reese Anthis, Ryan Louie, Yifan Mai, Peggy Yin, Myra Cheng, Samuel J Paech, Kevin Klyman, et al. 2026. Characterizing delusional spirals through human-LLM chat logs.arXiv preprint arXiv:2603.16567(2026)

work page arXiv 2026

[31] [31]

Neel Nanda, Andrew Lee, and Martin Wattenberg. 2023. Emergent linear repre- sentations in world models of self-supervised sequence models.arXiv preprint arXiv:2309.00941(2023)

work page arXiv 2023

[32] [32]

Stefan Palan and Christian Schitter. 2018. Prolific. ac—A subject pool for online experiments.Journal of behavioral and experimental finance17 (2018), 22–27

work page 2018

[33] [33]

Qian Pan, Zahra Ashktorab, Michael Desmond, Martin Santillan Cooper, James Johnson, Rahul Nair, Elizabeth Daly, and Werner Geyer. 2024. Human-Centered Design Recommendations for LLM-as-a-judge. InProceedings of the 1st Human- Centered Large Language Modeling Workshop. 16–29

work page 2024

[34] [34]

Rock Yuren Pang, KJ Feng, Shangbin Feng, Chu Li, Weijia Shi, Yulia Tsvetkov, Jeffrey Heer, and Katharina Reinecke. 2025. Interactive reasoning: Visualizing and controlling chain-of-thought reasoning in large language models.arXiv preprint arXiv:2506.23678(2025)

work page arXiv 2025

[35] [35]

My Boyfriend is AI

Pat Pataranutaporn, Sheer Karny, Chayapatr Archiwaranguprok, Constanze Albrecht, Auren R Liu, and Pattie Maes. 2025. " My Boyfriend is AI": A Computa- tional Analysis of Human-AI Companionship in Reddit’s AI Community.arXiv preprint arXiv:2509.11391(2025)

work page arXiv 2025

[36] [36]

Rachel Poonsiriwong, Chayapatr Archiwaranguprok, and Pat Pataranutaporn

work page

[37] [37]

" Death" of a Chatbot: Investigating and Designing Toward Psychologically Safe Endings for Human-AI Relationships.arXiv preprint arXiv:2602.07193(2026)

work page arXiv 2026

[38] [38]

Yao Rong, Tobias Leemann, Thai-Trang Nguyen, Lisa Fiedler, Peizhu Qian, Vaib- hav Unhelkar, Tina Seidel, Gjergji Kasneci, and Enkelejda Kasneci. 2023. Towards human-centered explainable ai: A survey of user studies for model explana- tions.IEEE transactions on pattern analysis and machine intelligence46, 4 (2023), 2104–2122

work page 2023

[39] [39]

Martin Schrepp, Andreas Hinderks, and Jörg Thomaschewski. 2017. Design and Evaluation of a Short Version of the User Experience Questionnaire (UEQ-S). International Journal of Interactive Multimedia and Artificial Intelligence4, 6 (2017), 103–108

work page 2017

[40] [40]

Manasi Sharma, Ho Chit Siu, Rohan Paleja, and Jaime D Peña. 2024. Why would you suggest that? human trust in language model responses.arXiv preprint arXiv:2406.02018(2024)

work page arXiv 2024

[41] [41]

Soorya Ram Shimgekar, Vipin Gunda, Jiwon Kim, Violeta J Rodriguez, Hari Sundaram, and Koustuv Saha. 2026. AI Psychosis: Does Conversational AI Amplify Delusion-Related Language?arXiv preprint arXiv:2603.19574(2026)

work page arXiv 2026

[42] [42]

Shriyank Somvanshi, Md Monzurul Islam, Amir Rafe, Anannya Ghosh Tusti, Arka Chakraborty, Anika Baitullah, Tausif Islam Chowdhury, Nawaf Alnawmasi, Anandi Dutta, and Subasish Das. 2026. Bridging the black box: A survey on mechanistic interpretability in ai.Comput. Surveys58, 8 (2026), 1–35

work page 2026

[43] [43]

Xin Sun, Shu Wei, Jos A Bosch, Isao Echizen, Saku Sugawara, and Abdallah El Ali. 2026. Seeing the Reasoning: How LLM Rationales Influence User Trust and Decision-Making in Factual Verification Tasks.arXiv preprint arXiv:2603.07306 (2026)

work page arXiv 2026

[44] [44]

Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. 2023. Linear representations of sentiment in large language models.arXiv preprint arXiv:2310.15154(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Fernanda Viégas and Martin Wattenberg. 2023. The system model and the user model: Exploring AI dashboard design.arXiv preprint arXiv:2305.02469(2023)

work page arXiv 2023

[46] [46]

Yanyun Wang, Xumei Fang, Zan Xu, Jianye Li, and Luping Wang. 2025. Exploring the Impact of Explainability in Large Language Model (LLM) Applications on User Experience. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–8

work page 2025

[47] [47]

Yusuke Yamauchi, Taro Yano, and Masafumi Oyamada. 2025. An empirical study of llm-as-a-judge: How design choices impact evaluation reliability.arXiv preprint arXiv:2506.13639(2025)

work page arXiv 2025

[48] [48]

Xinyuan Yan, Shusen Liu, Kowshik Thopalli, and Bei Wang. 2025. Visual Explo- ration of Feature Relationships in Sparse Autoencoders with Curated Concepts. arXiv preprint arXiv:2511.06048(2025)

work page arXiv 2025

[49] [49]

Jiachen Yu, Shaoning Sun, Xiaohui Hu, Jiaxu Yan, Kaidong Yu, and Xuelong Li

work page

[50] [50]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Improve llm-as-a-judge ability as a general ability. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 14110–14126

work page 2025

[51] [51]

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al

work page

[52] [52]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405(2023). A Behavioral Vector Validation Example of contrastive system prompts for the traitempathy: Positive system prompt:Respond to the user’s situ- ation with deep understanding, actively listening and reflecting their emotional experience with compassion ...

work page internal anchor Pith review Pith/arXiv arXiv 2023