Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models

Dimitrios Damianos; Georgios Paraskevopoulos; Georgios Skyrianos; Leon Voukoutis; Vassilis Katsouros

arxiv: 2605.22902 · v1 · pith:6KAZX25Bnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI· cs.CL

Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models

Dimitrios Damianos , Leon Voukoutis , Georgios Skyrianos , Vassilis Katsouros , Georgios Paraskevopoulos This is my paper

Pith reviewed 2026-05-25 06:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords transcodersvision-language modelshallucinationsvisual groundingmechanistic interpretabilityMLP approximationscircuit tracingGemma

0 comments

The pith

Transcoders decompose VLMs into pathways that link image patches to token generation and predict hallucinations via circuit graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that transcoders, which approximate the updates performed inside MLP sublayers, serve as a causal proxy that decomposes vision-language model computation into interpretable pathways from image patches to generated tokens. This function-centric view yields attributions that remain more stable and stronger under patch ablation than those obtained from sparse autoencoders, and that better match semantically relevant regions of the input image. A counterfactual test with false visual grounding shows the recovered pathways are specific to cross-modal interaction rather than generic language processing. Extracting graph-based features from the resulting circuit traces then allows a simple logistic classifier to identify hallucinated outputs at an AUC of 0.68. If these results hold, mechanistic accounts of multimodal generation become available that were previously inaccessible through static representation decompositions.

Core claim

Applied to Gemma 3-4B-IT, transcoders decompose the model into interpretable computational pathways linking image patches to directions in token generation. Transcoder attributions produce stronger and more stable effects on visually grounded tokens under patch ablation than SAE attributions, and align better with semantically relevant image regions. A False Visual Grounding counterfactual analysis confirms that the recovered pathways are specific to vision-language interaction. Structural analysis of hallucinated generations extracts graph-based indicators from circuit traces produced by the transcoders, enabling a logistic classifier over these mechanistic graph features to predict hallucn

What carries the argument

Transcoders as sparse approximations of MLP sublayers that act as a causal proxy for layer-wise functional updates.

If this is right

Transcoder attributions affect visually grounded tokens more strongly and stably than SAE attributions under patch ablation.
The recovered pathways align more closely with semantically relevant image regions than those from prior methods.
Counterfactual analysis with false visual grounding isolates the pathways as specific to vision-language interaction.
Graph features extracted from transcoder circuit traces support a logistic classifier that predicts hallucinations at AUC 0.68.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same transcoder decomposition could be run on other VLMs to test whether similar grounding pathways appear across architectures.
Targeted interventions on the identified circuits might reduce hallucination rates without full retraining.
Real-time extraction of the graph indicators could serve as an online detector for ungrounded outputs during generation.

Load-bearing premise

Transcoders serve as a faithful causal proxy for the functional updates inside the model's MLP sublayers.

What would settle it

An experiment in which ablating patches identified by transcoder attributions fails to produce larger or more stable changes to grounded token probabilities than SAE attributions, or in which the graph-feature classifier achieves AUC no higher than random, would falsify the central claims.

Figures

Figures reproduced from arXiv: 2605.22902 by Dimitrios Damianos, Georgios Paraskevopoulos, Georgios Skyrianos, Leon Voukoutis, Vassilis Katsouros.

**Figure 1.** Figure 1: Comparison of the top − 10 most important image patches identified by SAEs and Transcoders. Transcoders identify patches that are more aligned with visually grounded tokens, as reflected in both visual correspondence and their impact on token probability and entropy. token probability and entropy. In particular, removing Transcoder-identified patches leads to a larger decrease in token probability and a la… view at source ↗

**Figure 2.** Figure 2: Circuit analysis on captions: visually grounded tokens have clear semantic links to specific [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Attribution maps in the False Visual Grounding setting. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Circuit analysis on FVG setting: Transcoders reveal no correlation between the target and [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Example of computation path: We examine both per layer and per token paths. We visualize [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Top-1 ablation comparison 13 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Top-5 ablation comparison 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Top-5 ablation comparison 15 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Top-10 ablation comparison 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Top-10 ablation comparison 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Circuit analysis results 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Circuit analysis results 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Circuit analysis results 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Top-1 FVG ablation 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Top-5 FVG comparison 22 [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Top-10 FVG ablation 23 [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Circuit analysis on FVG 24 [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Circuit analysis on FVG 25 [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: Circuit analysis on FVG 26 [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

read the original abstract

Generative Vision-Language Models (VLMs) perform well on multimodal reasoning, but how visual inputs are transformed to text remains poorly understood. Existing interpretability work on VLMs uses Sparse Autoencoders (SAEs), which decompose static residual representations and miss the functional updates that drive cross-modal interaction. We adopt a function-centric framework based on Transcoders, sparse approximations of MLP sublayers that act as a causal proxy for layer-wise computation. Applied to Gemma 3-4B-IT, the framework decomposes the model into interpretable computational pathways linking image patches to directions in token generation. Transcoder attributions produce stronger and more stable effects on visually grounded tokens under patch ablation than SAE attributions, and align better with semantically relevant image regions. A False Visual Grounding counterfactual analysis confirms that the recovered pathways are specific to vision-language interaction.Finally, we perform a structural analysis of hallucinated generations, by extracting graph-based indicators from circuit traces produced by the transcoders. A logistic classifier over these mechanistic graph features predicts hallucinations at AUC $0.68$. These results show that function-centric circuit decomposition yields interpretable and predictive accounts of multimodal computation in VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Transcoders give stronger attributions than SAEs for VLM visual grounding and turn circuit traces into graph features that predict hallucinations at AUC 0.68, but the faithfulness of the proxy is unverified.

read the letter

The paper's main result is that transcoders applied to Gemma 3-4B-IT produce attributions with stronger and more stable effects on visually grounded tokens under patch ablation than SAE attributions, align better with semantic image regions, and support a logistic classifier on extracted graph features that reaches AUC 0.68 for hallucination prediction. A false visual grounding counterfactual is used to argue the pathways are specific to vision-language interaction. This is the first reported use of transcoders on VLMs for decomposing cross-modal computation in a function-centric way rather than static residual decompositions. The shift to approximating MLP sublayers as causal proxies lets the work link image patches directly to token-generation directions and extract structural indicators from the resulting traces. Those two pieces—the comparative ablation results and the graph-based prediction—are the concrete advances. The paper does a reasonable job motivating why function-centric methods could capture dynamic updates better and showing the pipeline on a real VLM. The soft spots are straightforward. The abstract supplies an AUC and qualitative claims about stability and alignment but gives no error bars, no dataset sizes, no statistical tests, and no ablation details on the transcoder itself. More critically, there is no reported check that reconstruction error stays low on image-patch activations or that the ablation advantages survive when the original MLP is used instead of the transcoder. If approximation artifacts correlate with visual features, the reported edge could be spurious. The stress-test concern therefore stands on the information given. This is aimed at the mechanistic interpretability crowd working on multimodal models. Readers already following SAE and circuit work on VLMs would see the incremental tool and the hallucination angle. It deserves a serious referee to examine the full methods section and any additional controls on faithfulness and experimental rigor.

Referee Report

2 major / 1 minor

Summary. The paper claims that a function-centric framework using Transcoders to approximate MLP sublayers in VLMs like Gemma 3-4B-IT provides interpretable computational pathways for visual grounding. Transcoder attributions are shown to have stronger and more stable effects on visually grounded tokens under patch ablation than SAE attributions, align better with semantic regions, and a logistic classifier using graph-based features from the traces predicts hallucinations with AUC 0.68. A counterfactual analysis supports the specificity to vision-language interaction.

Significance. Should the results be confirmed with additional controls, this approach could offer a more causal and predictive understanding of how visual inputs influence text generation in VLMs, improving upon representation-based methods like SAEs for tasks such as hallucination detection.

major comments (2)

[Abstract] The interpretation of transcoder attributions as direct drivers of cross-modal token generation depends on the unverified faithfulness of transcoders as causal proxies for the original MLP computation on VLM inputs. The manuscript provides no quantitative check of reconstruction error on image-patch activations or ablation results using the original MLP instead of the transcoder, which is central to the claim of superior stability and alignment.
[Hallucination prediction results] The reported AUC of 0.68 for the logistic classifier over mechanistic graph features is presented without error bars, details on the number of hallucinated vs. non-hallucinated examples, cross-validation procedure, or statistical tests, making it hard to evaluate the robustness of the predictive account.

minor comments (1)

[Notation] Define 'mechanistic graph features' more explicitly in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and will incorporate the suggested additions to strengthen the manuscript's claims regarding transcoder faithfulness and the robustness of the hallucination prediction results.

read point-by-point responses

Referee: [Abstract] The interpretation of transcoder attributions as direct drivers of cross-modal token generation depends on the unverified faithfulness of transcoders as causal proxies for the original MLP computation on VLM inputs. The manuscript provides no quantitative check of reconstruction error on image-patch activations or ablation results using the original MLP instead of the transcoder, which is central to the claim of superior stability and alignment.

Authors: We agree that the current manuscript lacks explicit quantitative faithfulness checks on image-patch activations and direct ablations comparing the transcoder to the original MLP. While the framework draws on established transcoder methodology and the False Visual Grounding counterfactual provides supporting evidence for specificity to vision-language interactions, these additional controls are needed to fully substantiate the causal proxy claim. We will add reconstruction error metrics on image-patch activations and original-MLP ablation comparisons in the revised manuscript. revision: yes
Referee: [Hallucination prediction results] The reported AUC of 0.68 for the logistic classifier over mechanistic graph features is presented without error bars, details on the number of hallucinated vs. non-hallucinated examples, cross-validation procedure, or statistical tests, making it hard to evaluate the robustness of the predictive account.

Authors: We acknowledge that the hallucination prediction results require additional details for proper evaluation. In the revision we will report error bars on the AUC (computed over cross-validation folds or bootstrap samples), the exact counts of hallucinated and non-hallucinated examples, the cross-validation procedure, and results of statistical tests against appropriate baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results rely on external benchmarks

full rationale

The paper applies transcoders to Gemma 3-4B-IT and evaluates attribution stability via patch ablation, semantic region alignment, and a False Visual Grounding counterfactual, all external to the fitted transcoder parameters. The AUC 0.68 arises from a logistic classifier trained on graph features extracted from circuit traces to predict an independent target (hallucinations). No equation or claim reduces a reported metric to a quantity defined by the authors' own fitted values or self-citation chain; the derivation remains self-contained against these benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described. The framework implicitly assumes that sparse approximations of MLP sublayers preserve causal structure, but this is not quantified.

pith-pipeline@v0.9.0 · 5757 in / 1053 out tokens · 21470 ms · 2026-05-25T06:18:45.748050+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 11 internal anchors

[1]

Saes are good for steering–if you select the right features

Dana Arad, Aaron Mueller, and Yonatan Belinkov. Saes are good for steering–if you select the right features. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 10252–10270,

work page 2025
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Mechanistic Interpretability for AI Safety -- A Review

Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety–a review.arXiv preprint arXiv:2404.14082,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Uncertainty quantification for stationary and time-dependent PDEs subject to Gevrey regular random domain deformations

Steven Bills, Nick Cammarata, Jeff Wu, et al. Revising and falsifying sparse autoencoder feature explanations.arXiv preprint arXiv:2502.12345,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

URL https://transformer-circuits.pub/ 2023/monosemantic-features/index.html. Accessed: 2026-04-26. Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Mon- itoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

M. Cho, D. Kim, et al. Corrsteer: Generation-time llm steering via correlated sae features.arXiv preprint arXiv:2601.09876,

work page arXiv
[7]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Controllable llm reasoning via sparse autoencoder-based steering.arXiv preprint arXiv:2601.03595,

Yi Fang, Wenjie Wang, Mingfeng Xue, Boyi Deng, Fengli Xu, Dayiheng Liu, and Fuli Feng. Controllable llm reasoning via sparse autoencoder-based steering.arXiv preprint arXiv:2601.03595,

work page arXiv
[9]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Gemma 3 Technical Report

URLhttps://arxiv.org/abs/2503.19786. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

ActivationReasoning: Logical Reasoning in Latent Activation Spaces

URLhttps://arxiv.org/abs/2510.18184. Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Ac- cessed: 2026-04-26

URL https://transformer-circuits.pub/2024/crosscoders/index.html. Ac- cessed: 2026-04-26. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916,

work page 2024
[13]

Scott M Lundberg and Su-In Lee

URLhttps://arxiv.org/abs/2502.17514. Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions.Advances in neural information processing systems, 30,

work page arXiv
[14]

When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models

URL https://arxiv.org/ abs/2507.13868. Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, and Zeynep Akata. Sparse autoencoders learn monosemantic features in vision-language models,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

URL https:// arxiv.org/abs/2504.02821. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,

work page arXiv
[16]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.71. URL https://aclanthology.org/2021.acl-long.71/. Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps.arXiv preprint arXiv:1312.6034,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2021.acl-long.71 2021
[17]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Interpretable steering of large language models with feature guided activation additions.arXiv preprint arXiv:2501.09929,

Samuel Soo, Chen Guang, Wesley Teng, Chandrasekaran Balaganesh, Tan Guoxian, and Yan Ming. Interpretable steering of large language models with feature guided activation additions.arXiv preprint arXiv:2501.09929,

work page arXiv
[19]

Anthropic Interpretability Team

URLhttps://arxiv.org/abs/2502.06755. Anthropic Interpretability Team. Insights on crosscoder model diffing.Anthropic Technical Blog,

work page arXiv
[20]

Circuit tracing in vision-language models: Understanding the internal mechanisms of multimodal thinking.arXiv preprint arXiv:2602.20330,

Jingcheng Yang, Tianhu Xiong, Shengyi Qian, Klara Nahrstedt, and Mingyuan Wu. Circuit tracing in vision-language models: Understanding the internal mechanisms of multimodal thinking.arXiv preprint arXiv:2602.20330,

work page arXiv
[21]

Interpreting clip with hierarchical sparse autoencoders.arXiv preprint arXiv:2502.20578,

Vladimir Zaigrajew, Hubert Baniecki, and Przemyslaw Biecek. Interpreting clip with hierarchical sparse autoencoders.arXiv preprint arXiv:2502.20578,

work page arXiv

[1] [1]

Saes are good for steering–if you select the right features

Dana Arad, Aaron Mueller, and Yonatan Belinkov. Saes are good for steering–if you select the right features. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 10252–10270,

work page 2025

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Mechanistic Interpretability for AI Safety -- A Review

Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety–a review.arXiv preprint arXiv:2404.14082,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Uncertainty quantification for stationary and time-dependent PDEs subject to Gevrey regular random domain deformations

Steven Bills, Nick Cammarata, Jeff Wu, et al. Revising and falsifying sparse autoencoder feature explanations.arXiv preprint arXiv:2502.12345,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

URL https://transformer-circuits.pub/ 2023/monosemantic-features/index.html. Accessed: 2026-04-26. Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Mon- itoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

M. Cho, D. Kim, et al. Corrsteer: Generation-time llm steering via correlated sae features.arXiv preprint arXiv:2601.09876,

work page arXiv

[7] [7]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Controllable llm reasoning via sparse autoencoder-based steering.arXiv preprint arXiv:2601.03595,

Yi Fang, Wenjie Wang, Mingfeng Xue, Boyi Deng, Fengli Xu, Dayiheng Liu, and Fuli Feng. Controllable llm reasoning via sparse autoencoder-based steering.arXiv preprint arXiv:2601.03595,

work page arXiv

[9] [9]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Gemma 3 Technical Report

URLhttps://arxiv.org/abs/2503.19786. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

ActivationReasoning: Logical Reasoning in Latent Activation Spaces

URLhttps://arxiv.org/abs/2510.18184. Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Ac- cessed: 2026-04-26

URL https://transformer-circuits.pub/2024/crosscoders/index.html. Ac- cessed: 2026-04-26. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916,

work page 2024

[13] [13]

Scott M Lundberg and Su-In Lee

URLhttps://arxiv.org/abs/2502.17514. Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions.Advances in neural information processing systems, 30,

work page arXiv

[14] [14]

When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models

URL https://arxiv.org/ abs/2507.13868. Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, and Zeynep Akata. Sparse autoencoders learn monosemantic features in vision-language models,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

URL https:// arxiv.org/abs/2504.02821. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,

work page arXiv

[16] [16]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.71. URL https://aclanthology.org/2021.acl-long.71/. Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps.arXiv preprint arXiv:1312.6034,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2021.acl-long.71 2021

[17] [17]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Interpretable steering of large language models with feature guided activation additions.arXiv preprint arXiv:2501.09929,

Samuel Soo, Chen Guang, Wesley Teng, Chandrasekaran Balaganesh, Tan Guoxian, and Yan Ming. Interpretable steering of large language models with feature guided activation additions.arXiv preprint arXiv:2501.09929,

work page arXiv

[19] [19]

Anthropic Interpretability Team

URLhttps://arxiv.org/abs/2502.06755. Anthropic Interpretability Team. Insights on crosscoder model diffing.Anthropic Technical Blog,

work page arXiv

[20] [20]

Circuit tracing in vision-language models: Understanding the internal mechanisms of multimodal thinking.arXiv preprint arXiv:2602.20330,

Jingcheng Yang, Tianhu Xiong, Shengyi Qian, Klara Nahrstedt, and Mingyuan Wu. Circuit tracing in vision-language models: Understanding the internal mechanisms of multimodal thinking.arXiv preprint arXiv:2602.20330,

work page arXiv

[21] [21]

Interpreting clip with hierarchical sparse autoencoders.arXiv preprint arXiv:2502.20578,

Vladimir Zaigrajew, Hubert Baniecki, and Przemyslaw Biecek. Interpreting clip with hierarchical sparse autoencoders.arXiv preprint arXiv:2502.20578,

work page arXiv