pith. sign in

arxiv: 2605.16468 · v1 · pith:2T6CAORGnew · submitted 2026-05-15 · 💻 cs.CV · cs.AI· cs.CL· cs.LG· q-bio.NC

Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex

Pith reviewed 2026-05-20 19:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LGq-bio.NC
keywords mechanistic interpretabilityneural encodingvisual cortexvoxel selectivitycounterfactual editingfMRIfeature attributionbrain-AI alignment
0
0 comments X

The pith

Mechanistic interpretability applied to neural encoding models identifies the specific visual features driving responses in individual human brain voxels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to move beyond correlational predictions by using mechanistic interpretability on language-aligned neural networks to pinpoint which features in natural images causally influence millimeter-scale voxel activity in visual cortex. It produces semantic descriptions of those features for each image and then generalizes them into stable per-voxel selectivity profiles. Validation shows that images synthesized from the extracted features elicit voxel responses matching the originals, while targeted insertion or removal of the same features shifts measured activation in the predicted direction. Profiles derived across many images produce even larger shifts, and the approach recovers broad category selectivity in known regions while exposing distinct fine-grained preferences unique to each voxel.

Core claim

Mechanistically Interpretable Neural Encoding (MINE) opens black-box encoding models by applying attribution techniques to language-aligned image representations, thereby localizing the visual features inside natural images that drive each voxel's response. These per-image feature descriptions suffice to synthesize new images that produce voxel activations statistically indistinguishable from the originals, outperforming random or low-attribution controls. Counterfactual insertion or deletion of the identified features shifts voxel activation in the expected direction; the same edits guided by the aggregated per-voxel activation profiles produce still larger shifts, confirming that the per-v

What carries the argument

Mechanistically Interpretable Neural Encoding (MINE), which extracts and semantically describes image features via mechanistic interpretability applied to language-aligned neural network representations in order to predict and causally test voxel-level responses.

If this is right

  • Counterfactual editing guided by per-voxel activation profiles yields stronger and more reliable shifts in measured brain activity than edits based on single-image features.
  • The same framework recovers the expected categorical preferences of well-studied regions while exposing distinct selectivity patterns unique to each voxel inside those regions.
  • Images synthesized solely from the localized features elicit voxel responses that match the originals more accurately than control images.
  • Generalizing per-image attributions into per-voxel profiles produces faithful summaries of each voxel's functional selectivity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Language-aligned models may capture semantic structure that aligns more closely with human visual cortex than purely vision-trained networks.
  • The causal-editing validation step could be adapted to test feature hypotheses in other brain regions or sensory modalities.
  • Fine-grained voxel-level profiles might eventually support more precise brain-computer interface decoding or stimulation.
  • The method supplies a concrete route for generating and testing new, testable hypotheses about visual coding that go beyond existing category-level descriptions.

Load-bearing premise

The attributions extracted from the artificial network correctly mark the image features that actually cause the observed human voxel responses rather than reflecting model-specific artifacts.

What would settle it

Counterfactual edits that insert or remove the predicted features fail to shift voxel activation in the direction the model forecasts, or images generated from those features produce responses no closer to the originals than images generated from random or low-attribution controls.

Figures

Figures reproduced from arXiv: 2605.16468 by Galit Yovel, Idan Daniel Grosbard, Mor Geva.

Figure 1
Figure 1. Figure 1: Overview of the MINE framework. (a) A neural encoder is trained on textually aligned [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Predicted activations for stimuli generated according to critical features. (a) Error distribu [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Predicted activations for images counterfactually edited according to critical features. (a) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation of voxel profiles via counterfactual editing. (a–d) Examples of original non [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-voxel model performance comparison. (a) Distribution of per-voxel [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-voxel comparison of explained variance ( [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Results for the mean-patching analysis, when targeting [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for image level analysis. You are analyzing the role of a classification neuron in a vision model. Below are 10 descriptions of critical visual features from different correctly classified images: . . . Based on these descriptions, identify the shared underlying visual information that this neuron appears to encode. What specific features, patterns, or concepts does this neuron detect? Provide a 1-s… view at source ↗
Figure 9
Figure 9. Figure 9: Prompt for neuron level analysis. D.4 Voxel profile generation For each voxel, we collected all descriptions from the counterfactual image-editing pipeline (see Section 4.5) whose faithfulness score lay in the upper quartile of that voxel’s faithfulness distribution and required a minimum of three surviving trials per voxel. This yielded 225 voxel-subject profiles, each supported by a mean of 175 trials (r… view at source ↗
Figure 10
Figure 10. Figure 10: Decoded descriptions for eight CIFAR-10 classes. [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt for voxel profile generation. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt used to identify shared and unique voxel profile content per ROI. [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: EBA voxel profiles. Shared profile: Dynamic human figures engaged in athletic activities and sports (particularly water sports like surfing and winter sports like skiing), wearing athletic or specialized gear, often in motion or action contexts. Secondary consistent preference for animals in natural or outdoor settings. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: FBA-1 voxel profiles. Shared profile: Strong preference for human subjects in distinctive contexts (professional attire, sports activities, formal settings) and animals with visible anatomical or facial details. Consistent sensitivity to portraits, close-up facial features, and subjects engaged in identifiable roles or actions. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: FBA-2 voxel profiles. Shared profile: Strong responsiveness to human and animal subjects in dynamic or formal contexts, with consistent sensitivity to facial features, distinctive clothing/attire, and outdoor or athletic settings. Activation across diverse subjects including people engaged in sports (skiing, surfing, tennis), wildlife (elephants, polar bears, cattle), and individuals in formal or winter w… view at source ↗
Figure 18
Figure 18. Figure 18: FFA-1 voxel profiles. (1/2) Shared profile: Human faces and upper bodies in professional, formal, or distinctive contexts (business attire, uniforms, sports wear); animals with visible anatomical and facial details; dynamic physical activities and action scenes; sensitivity to clothing, facial features, and compositional structure. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗
Figure 18
Figure 18. Figure 18: FFA-1 voxel profiles (continued). voxel_20662 voxel_20732 voxel_20737 voxel_20794 voxel_20795 [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: FFA-2 voxel profiles. Shared profile: Strong responsiveness to human figures and animals in dynamic or characteristic contexts, with sensitivity to distinctive visual features including clothing, body positioning, and facial/anatomical details. Consistent activation for active subjects (athletes, people in motion, wildlife) and outdoor or natural settings. voxel_2271 voxel_2528 voxel_2529 voxel_3094 voxel… view at source ↗
Figure 20
Figure 20. Figure 20: OFA voxel profiles. Shared profile: Dynamic human figures engaged in physical activities, sports, and movement across diverse contexts (skiing, baseball, skateboarding, tennis, surfing), combined with sensitivity to athletic/specialized gear and apparel, as well as animals in motion and outdoor/active scenarios. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: OPA voxel profiles. Shared profile: Strong preference for structured, functional indoor spaces (particularly kitchens and bathrooms with fixtures), organized domestic interiors with clear architectural elements, and human-made environments with purposeful design and infrastructure. voxel_8629 voxel_8734 voxel_8851 voxel_8862 voxel_8869 [PITH_FULL_IMAGE:figures/full_fig_p037_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: PPA voxel profiles. Shared profile: Preference for structured, organized environments with clear spatial hierarchies and functional purposes, including both indoor domestic/functional spaces (kitchens, bathrooms, offices) and outdoor scenes with defined architectural or infrastructural elements. voxel_7650 voxel_7651 voxel_7835 voxel_8015 voxel_18720 [PITH_FULL_IMAGE:figures/full_fig_p037_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: RSC voxel profiles. Shared profile: Structured environments with organized spatial elements, human figures in various contexts, and architectural or infrastructural frameworks. All voxels show sensitivity to both indoor functional spaces and outdoor scenes with clear organizational or structural components. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: VWFA-1 voxel profiles. Shared profile: Strong preference for human subjects across professional, formal, and athletic contexts. Consistent sensitivity to facial features, clothing details (particularly formal wear like suits and uniforms), and human upper bodies/portraits. Secondary responsiveness to animals with prominent facial features. voxel_8912 voxel_8915 voxel_9023 voxel_16964 voxel_16974 [PITH_FU… view at source ↗
Figure 25
Figure 25. Figure 25: VWFA-2 voxel profiles. Shared profile: Dynamic action and athletic activities with people in motion, wearing sport-specific attire and protective gear, engaged in physical movement and sports contexts. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: aTL-faces voxel profiles. Shared profile: All voxels show strong responsiveness to human subjects with attention to facial features, distinctive clothing details, and contextual settings. Secondary responsiveness to animals (particularly mammals like dogs) with emphasis on facial features and distinctive visual characteristics is present across voxels. Sensitivity to both portrait/close-up compositions an… view at source ↗
Figure 27
Figure 27. Figure 27: mTL-words voxel profiles. Shared profile: All voxels show strong responsiveness to human subjects and animals with emphasis on facial features, distinctive visual characteristics, and contextual details. There is consistent sensitivity to both people and animals across various compositions and attire types. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: mfs-words voxel profiles. Shared profile: Close-up facial imagery and anatomical details of both humans and animals, with sensitivity to portraits, faces with distinct features, and animals depicted with visible body parts and physical characteristics. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_28.png] view at source ↗
read the original abstract

A central goal in understanding human vision is to uncover the visual features that drive neuronal activity. A growing body of work has used artificial neural networks as encoding models to predict cortical responses to natural images, revealing the visual content that activates category-selective regions. However, existing approaches are largely correlational and treat the encoder as a black box, leaving open which image features drive each voxel's response. We introduce Mechanistically Interpretable Neural Encoding (MINE), a framework that opens this black box by applying mechanistic-interpretability tools to localize the features within natural images that drive millimeter-scale (voxel-level) activity. MINE predicts each voxel's response using language-aligned image representations, and produces semantically interpretable descriptions of the features critical for the voxel's activation. We further generalize these per-image features into per-voxel functional profiles. To validate the per-image descriptions, we show they are sufficient to generate images that elicit voxel responses matching the responses to the original images, more accurately than images generated from random or low-attribution controls. Moreover, counterfactually inserting or removing the predicted features from images shifts activation in the expected direction, providing causal evidence. Counterfactual editing guided by the per-voxel activation profiles produces even stronger activation shifts, indicating that the profiles faithfully capture each voxel's selectivity. Finally, we apply MINE to well-studied category-selective brain regions, showing it recovers their known categorical preferences while revealing fine-grained unique voxel structure within each region. Overall, our results establish mechanistic interpretability as a path to discover and causally validate fine-grained hypotheses about neural function.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Mechanistically Interpretable Neural Encoding (MINE), a framework that applies mechanistic-interpretability tools to language-aligned image representations from neural networks to predict voxel-level responses in human visual cortex and extract semantically interpretable feature descriptions. It validates these attributions by showing that images generated from the descriptions elicit matching voxel responses (superior to random or low-attribution controls) and that counterfactual insertion or removal of the predicted features shifts activations in the expected direction. Per-voxel functional profiles derived from these attributions produce even stronger shifts, and the method is applied to category-selective regions to recover known preferences while revealing fine-grained voxel structure.

Significance. If the causal claims survive rigorous controls for editing confounds, the work would meaningfully advance the integration of mechanistic interpretability with systems neuroscience by moving encoding models from correlational to causally testable accounts of fine-grained functional selectivity. The combination of per-image attributions, generative validation, and profile-guided counterfactuals provides a concrete path for hypothesis generation and testing that could generalize beyond visual cortex.

major comments (2)
  1. [Validation / Counterfactual experiments] The central causal interpretation rests on the counterfactual editing results (described in the abstract and validation sections). The manuscript does not report quantitative controls demonstrating that edits alter only the target attributed features while leaving low- and mid-level statistics (contrast, texture, spatial frequency, object co-occurrence) unchanged; without such metrics or matched non-semantic edit controls, directionally correct voxel shifts could arise from un-attributed confounds rather than the MINE attributions themselves.
  2. [Results on per-voxel profiles] The per-voxel activation profile results inherit the same vulnerability: stronger shifts are reported, yet the editing procedure used to test them is not shown to isolate the profile-derived features. A load-bearing test would be to compare against edits that preserve the profile but scramble the specific attribution map, or to quantify residual changes in non-profile features.
minor comments (2)
  1. [Abstract] The abstract refers to 'language-aligned image representations' without naming the specific model, alignment objective, or layer used; this notation should be clarified in the main text with a reference to the exact architecture.
  2. [Abstract and Results] Quantitative effect sizes (e.g., mean activation change in standard deviations, statistical tests against controls) are not summarized for the generation or counterfactual experiments; adding these would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which identify key opportunities to strengthen the causal interpretations in our validation experiments. We respond to each major comment below and will incorporate the suggested controls into the revised manuscript.

read point-by-point responses
  1. Referee: [Validation / Counterfactual experiments] The central causal interpretation rests on the counterfactual editing results (described in the abstract and validation sections). The manuscript does not report quantitative controls demonstrating that edits alter only the target attributed features while leaving low- and mid-level statistics (contrast, texture, spatial frequency, object co-occurrence) unchanged; without such metrics or matched non-semantic edit controls, directionally correct voxel shifts could arise from un-attributed confounds rather than the MINE attributions themselves.

    Authors: We agree that additional quantitative controls are necessary to rule out low- and mid-level confounds. Although our existing random and low-attribution controls provide initial evidence of specificity, they do not directly quantify preservation of low-level statistics. In the revised manuscript we will add explicit metrics comparing contrast, texture, spatial frequency, and object co-occurrence between original and edited images. We will also introduce matched non-semantic edit controls (e.g., low-level perturbations that preserve semantic content) to demonstrate that activation shifts are driven by the attributed features rather than incidental image statistics. revision: yes

  2. Referee: [Results on per-voxel profiles] The per-voxel activation profile results inherit the same vulnerability: stronger shifts are reported, yet the editing procedure used to test them is not shown to isolate the profile-derived features. A load-bearing test would be to compare against edits that preserve the profile but scramble the specific attribution map, or to quantify residual changes in non-profile features.

    Authors: This is a fair and important point for validating the per-voxel profiles. We will add the requested controls in the revision: (1) edits that preserve the overall profile statistics but scramble the specific per-image attribution maps, and (2) quantification of residual changes in non-profile features. These analyses will directly test whether the stronger activation shifts arise from the profile-guided features rather than from unaccounted confounds. revision: yes

Circularity Check

0 steps flagged

No circularity: validations depend on independent external brain measurements

full rationale

The paper introduces MINE to predict voxel responses from language-aligned representations and derives per-image feature descriptions plus per-voxel profiles. Validation proceeds by generating images from those descriptions and measuring actual fMRI voxel responses, plus performing counterfactual feature insertion/removal on images and observing directional shifts in measured human brain activations. These steps rely on external empirical data collected independently of the model's internal attributions or any fitted parameters within the present work. No equations or claims reduce a prediction to a fitted input by construction, no self-citation chain bears the central load, and no ansatz or uniqueness result is imported from prior author work to force the outcome. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters or axioms; the approach implicitly relies on standard assumptions of fMRI preprocessing and the fidelity of the chosen neural network as a model of visual cortex.

pith-pipeline@v0.9.0 · 5837 in / 1101 out tokens · 41207 ms · 2026-05-20T19:31:32.784655+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · 3 internal anchors

  1. [1]

    Marr and T

    D. Marr and T. Poggio. From Understanding Computation to Understanding Neural Circuitry. May 1976

  2. [2]

    D. Marr. Visual information processing: The structure and creation of visual representations. Philosophical Transactions of the Royal Society of London. B, Biological Sciences, 290(1038): 199–218, July 1980. ISSN 0080-4622. doi: 10.1098/rstb.1980.0091

  3. [3]

    The fusiform face area: A module in human extrastriate cortex specialized for face perception.Journal of neuroscience, 17(11): 4302–4311, 1997

    Nancy Kanwisher, Josh McDermott, and Marvin M Chun. The fusiform face area: A module in human extrastriate cortex specialized for face perception.Journal of neuroscience, 17(11): 4302–4311, 1997

  4. [4]

    A cortical representation of the local visual environment

    Russell Epstein and Nancy Kanwisher. A cortical representation of the local visual environment. Nature, 392(6676):598–601, 1998. ISSN 1476-4687. doi: 10.1038/33402

  5. [5]

    Downing, Yuhong Jiang, Miles Shuman, and Nancy Kanwisher

    Paul E. Downing, Yuhong Jiang, Miles Shuman, and Nancy Kanwisher. A Cortical Area Selective for Visual Processing of the Human Body.Science, 293(5539):2470–2473, September

  6. [6]
  7. [7]

    LaVCa: LLM-assisted Visual Cortex Captioning.arXiv preprint arXiv:2502.13606, 2025

    Takuya Matsuyama, Shinji Nishimoto, and Yu Takagi. LaVCa: LLM-assisted Visual Cortex Captioning.arXiv preprint arXiv:2502.13606, 2025

  8. [8]

    In Silico Mapping of Visual Categorical Selectivity Across the Whole Brain, October 2025

    Ethan Hwang, Hossein Adeli, Wenxuan Guo, Andrew Luo, and Nikolaus Kriegeskorte. In Silico Mapping of Visual Categorical Selectivity Across the Whole Brain, October 2025

  9. [9]

    BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain, December 2025

    Navve Wasserman, Matias Cosarinsky, Yuval Golbari, Aude Oliva, Antonio Torralba, Tamar Rott Shaham, and Michal Irani. BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain, December 2025

  10. [10]

    Brainscuba: Fine- grained natural language captions of visual cortex selectivity.arXiv preprint arXiv:2310.04420, 2023

    Andrew F Luo, Margaret M Henderson, Michael J Tarr, and Leila Wehbe. Brainscuba: Fine- grained natural language captions of visual cortex selectivity.arXiv preprint arXiv:2310.04420, 2023

  11. [11]

    CLIP-MSM: A Multi-Semantic Mapping Brain Representation for Human High-Level Visual Cortex

    Guoyuan Yang, Mufan Xue, Ziming Mao, Haofang Zheng, Jia Xu, Dabin Sheng, Ruotian Sun, Ruoqi Yang, and Xuesong Li. CLIP-MSM: A Multi-Semantic Mapping Brain Representation for Human High-Level Visual Cortex. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9184–9192, 2025

  12. [12]

    Brain diffusion for visual exploration: Cortical discovery using large scale generative models.Advances in Neural Information Processing Systems, 36:75740–75781, 2023

    Andrew Luo, Maggie Henderson, Leila Wehbe, and Michael Tarr. Brain diffusion for visual exploration: Cortical discovery using large scale generative models.Advances in Neural Information Processing Systems, 36:75740–75781, 2023. 10

  13. [13]

    Tan Gao, Mufan Xue, Haofang Zheng, Shuo Lv, Jia Xu, Dabin Sheng, Ziming Mao, Xinyu Wu, Andrew Luo, and Guoyuan Yang. BrainLMM: A Label-Free Framework for Mapping Multi- Semantic Representation in the Human Visual Cortex.Proceedings of the AAAI Conference on Artificial Intelligence, 40(6):4176–4184, March 2026. ISSN 2374-3468. doi: 10.1609/aaai. v40i6.42413

  14. [14]

    Luo, Jacob Yeung, Rushikesh Zawar, Shaurya Dewan, Margaret M

    Andrew F. Luo, Jacob Yeung, Rushikesh Zawar, Shaurya Dewan, Margaret M. Henderson, Leila Wehbe, and Michael J. Tarr. Brain Mapping with Dense Features: Grounding Cortical Semantic Selectivity in Natural Images With Vision Transformers, June 2025

  15. [15]

    Transformer brain encoders explain human high-level visual responses, May 2025

    Hossein Adeli, Minni Sun, and Nikolaus Kriegeskorte. Transformer brain encoders explain human high-level visual responses, May 2025

  16. [16]

    Towards Interpretable Visual Decoding with Attention to Brain Representa- tions, September 2025

    Pinyuan Feng, Hossein Adeli, Wenxuan Guo, Fan Cheng, Ethan Hwang, and Nikolaus Kriegeskorte. Towards Interpretable Visual Decoding with Attention to Brain Representa- tions, September 2025

  17. [17]

    An Interpretability Illusion for BERT, April 2021

    Tolga Bolukbasi, Adam Pearce, Ann Yuan, Andy Coenen, Emily Reif, Fernanda Viégas, and Martin Wattenberg. An Interpretability Illusion for BERT, April 2021

  18. [18]

    Causal Analysis for Robust Interpretability of Neural Networks

    Ola Ahmad, Nicolas Béreux, Loïc Baret, Vahid Hashemi, and Freddy Lecue. Causal Analysis for Robust Interpretability of Neural Networks. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4685–4694, 2024

  19. [19]

    Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, and Tom McGrath

    Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, D...

  20. [20]

    The Cognitive Revolution in Interpretability: From Explaining Behavior to Interpreting Representations and Algorithms, August 2024

    Adam Davies and Ashkan Khakzar. The Cognitive Revolution in Interpretability: From Explaining Behavior to Interpreting Representations and Algorithms, August 2024

  21. [21]

    Interpreting GPT: The logit lens — LessWrong

    nostalgebraist. Interpreting GPT: The logit lens — LessWrong. August 2020

  22. [22]

    Prince, John A

    Muquan Yu, Mu Nan, Hossein Adeli, Jacob S. Prince, John A. Pyles, Leila Wehbe, Margaret M. Henderson, Michael J. Tarr, and Andrew F. Luo. Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex, November 2025

  23. [23]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021

  24. [24]

    Wang, Kendrick Kay, Thomas Naselaris, Michael J

    Aria Y . Wang, Kendrick Kay, Thomas Naselaris, Michael J. Tarr, and Leila Wehbe. Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset.Nature Machine Intelligence 2023 5:12, 5(12):1415–1426, November

  25. [25]
  26. [26]

    arXiv preprint arXiv:2204.10965 , year=

    Tuomas Oikarinen and Tsui-Wei Weng. Clip-dissect: Automatic description of neuron repre- sentations in deep vision networks.arXiv preprint arXiv:2204.10965, 2022

  27. [27]

    Ron Mokady, Amir Hertz, and Amit H. Bermano. ClipCap: CLIP Prefix for Image Captioning. November 2021

  28. [28]

    Emogen: Emotional image content generation with text-to-image diffusion models,

    Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero-Soriano. A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26690–26699, Seattle, W A, USA, June 2024. IEEE. ISBN 979-8-3503-5300-6. doi...

  29. [29]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning. Advances in Neural Information Processing Systems, 36, April 2023. ISSN 10495258. 11

  30. [30]

    Improved Baselines with Visual Instruction Tuning, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved Baselines with Visual Instruction Tuning, 2024

  31. [31]

    Towards Interpreting Visual Information Processing in Vision-Language Models

    Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards Interpreting Visual Information Processing in Vision-Language Models. October 2024

  32. [32]

    Allen, Ghislain St-Yves, Yihan Wu, Jesse L

    Emily J. Allen, Ghislain St-Yves, Yihan Wu, Jesse L. Breedlove, Jacob S. Prince, Logan T. Dow- dle, Matthias Nau, Brad Caron, Franco Pestilli, Ian Charest, J. Benjamin Hutchinson, Thomas Naselaris, and Kendrick Kay. A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence.Nature Neuroscience, 25(1):116–126, January 2022. ISSN...

  33. [33]

    Masset, R

    Guy Gaziv, Roman Beliy, Niv Granot, Assaf Hoogi, Francesca Strappini, Tal Golan, and Michal Irani. Self-supervised Natural Image Reconstruction and Large-scale Semantic Classification from Brain Activity.NeuroImage, 254:119121, July 2022. ISSN 1053-8119. doi: 10.1016/J. NEUROIMAGE.2022.119121

  34. [34]

    More than meets the eye: Self-supervised depth reconstruction from brain activity.arXiv preprint arXiv:2106.05113, 2021

    Guy Gaziv and Michal Irani. More than meets the eye: Self-supervised depth reconstruction from brain activity.arXiv preprint arXiv:2106.05113, 2021

  35. [35]

    Sarthak Jain and Byron C. Wallace. Attention is not Explanation, May 2019

  36. [36]

    Locating and Editing Factual Associations in GPT.Advances in Neural Information Processing Systems, 35, 2022

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and Editing Factual Associations in GPT.Advances in Neural Information Processing Systems, 35, 2022. ISSN 10495258

  37. [37]

    Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the V ocabulary Space.Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, pages 30–45, March 2022. doi: 10.18653/v1/2022.emnlp-main.3

  38. [38]

    Analyzing Transformers in Embedding Space, 2023

    Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. Analyzing Transformers in Embedding Space, 2023

  39. [39]

    Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting Recall of Factual Associations in Auto-Regressive Language Models.EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings, pages 12216–12235, April 2023. doi: 10.18653/v1/2023.emnlp-main.751

  40. [40]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention Is All You Need.Advances in Neural Information Processing Systems, 2017-December:5999–6009, June 2017. ISSN 10495258. doi: 10.48550/ arxiv.1706.03762

  41. [41]

    Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks, March 2026

    Yuval Ran-Milo. Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks, March 2026

  42. [42]

    Why do LLMs attend to the first token?, August 2025

    Federico Barbero, Álvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veliˇckovi´c, and Razvan Pascanu. Why do LLMs attend to the first token?, August 2025

  43. [43]

    Jordan, and Song Mei

    Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, and Song Mei. Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs, Novem- ber 2024

  44. [44]

    The Linear Representation Hypothesis and the Geometry of Large Language Models, July 2024

    Kiho Park, Yo Joong Choe, and Victor Veitch. The Linear Representation Hypothesis and the Geometry of Large Language Models, July 2024

  45. [45]

    Enhancing Auto- mated Interpretability with Output-Centric Feature Descriptions, May 2025

    Yoav Gur-Arieh, Roy Mayan, Chen Agassy, Atticus Geiger, and Mor Geva. Enhancing Auto- mated Interpretability with Output-Centric Feature Descriptions, May 2025

  46. [46]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part v 13, pages 740–755. Springer, 2014. 12

  47. [47]

    Mruczek, Michael J

    Liang Wang, Ryan E.B. Mruczek, Michael J. Arcaro, and Sabine Kastner. Probabilistic Maps of Visual Topography in Human Cortex.Cerebral Cortex, 25(10):3911–3931, October 2015. ISSN 1047-3211. doi: 10.1093/CERCOR/BHU277

  48. [48]

    The Wisdom of a Crowd of Brains: A Universal Brain Encoder, March 2025

    Roman Beliy, Navve Wasserman, Amit Zalcher, and Michal Irani. The Wisdom of a Crowd of Brains: A Universal Brain Encoder, March 2025

  49. [49]

    Axiomatic Attribution for Deep Networks

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic Attribution for Deep Networks. InProceedings of the 34th International Conference on Machine Learning, pages 3319–3328. PMLR, July 2017

  50. [50]

    Introducing Claude Haiku 4.5

    Anthropic. Introducing Claude Haiku 4.5. https://www.anthropic.com/news/claude-haiku-4-5, October 2025

  51. [51]

    Hsu, Richard Antonello, Shailee Jain, Alexander G

    Chandan Singh, Aliyah R. Hsu, Richard Antonello, Shailee Jain, Alexander G. Huth, Bin Yu, and Jianfeng Gao. Explaining black box text modules in natural language with language models, November 2023

  52. [52]

    black-forest-labs, March 2026

    Black-forest-labs/flux2. black-forest-labs, March 2026

  53. [53]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space, June 2025

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 Kontext: Flow Matching for In-Context Image ...

  54. [54]

    Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms, July 2024

    Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov. Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms, July 2024

  55. [55]

    Priebe and F

    S. Priebe and F. Röhricht. Specific body image pathology in acute schizophrenia.Psychiatry Research, 101(3), 2001. ISSN 01651781. doi: 10.1016/S0165-1781(01)00214-1

  56. [56]

    URL https://www.nature.com/articles/s41467-024-53147-y

    Colin Conwell, Jacob S. Prince, Kendrick N. Kay, George A. Alvarez, and Talia Konkle. A large-scale examination of inductive biases shaping high-level visual representation in brains and machines.Nature communications, 15(1):1–18, October 2024. ISSN 2041-1723. doi: 10.1038/s41467-024-53147-y

  57. [57]

    Weiner, and Kalanit Grill-Spector

    Anthony Stigliani, Kevin S. Weiner, and Kalanit Grill-Spector. Temporal Processing Capacity in High-Level Visual Cortex Is Domain Specific.Journal of Neuroscience, 35(36):12412–12424, September 2015. ISSN 0270-6474, 1529-2401. doi: 10.1523/JNEUROSCI.4822-14.2015

  58. [58]

    Deep residual learning for image recognition,

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2016-December, 2016. doi: 10.1109/CVPR.2016.90

  59. [59]

    Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019

  60. [60]

    On Layer Normalization in the Transformer Architecture, June 2020

    Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On Layer Normalization in the Transformer Architecture, June 2020

  61. [61]

    Decoupled Weight Decay Regularization.7th International Conference on Learning Representations, ICLR 2019, November 2017

    Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization.7th International Conference on Learning Representations, ICLR 2019, November 2017

  62. [62]

    Super-convergence: Very fast training of neural networks using large learning rates

    Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates. InArtificial Intelligence and Machine Learning for Multi-Domain Operations Applications, volume 11006, pages 369–386. SPIE, 2019

  63. [63]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performa...

  64. [64]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick ...

  65. [65]

    Vision- Language Models Align with Human Neural Representations in Concept Processing, January 2026

    Anna Bavaresco, Marianne de Heer Kloots, Sandro Pezzelle, and Raquel Fernández. Vision- Language Models Align with Human Neural Representations in Concept Processing, January 2026

  66. [66]

    High-level visual cortex representations are organized along visual rather than abstract principles, April 2025

    Adva Shoham, Rotem Broday-Dvir, Rafael Malach, and Galit Yovel. High-level visual cortex representations are organized along visual rather than abstract principles, April 2025

  67. [67]

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bha- gia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, ...

  68. [68]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transform- ers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

  69. [69]

    Diffusers: State-of-the-art diffusion models, 2022

    Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State-of-the-art diffusion models, 2022

  70. [70]

    Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp,

    Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive NLP.arXiv preprint arXiv:2212.14024, 2022

  71. [71]

    Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines. The Twelfth International Conference on Learning Representations, 2024

  72. [72]

    J. D. Hunter. Matplotlib: A 2D graphics environment.Computing in Science & Engineering, 9 (3):90–95, 2007. doi: 10.1109/MCSE.2007.55

  73. [73]

    Collaborative data science

    Plotly Technologies Inc. Collaborative data science. https://plot.ly, 2015

  74. [74]

    ImageNet Large Scale Visual Recognition Challenge,

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C Berg, Li Fei-Fei, O Russakovsky, J Deng, H Su, J Krause, S Satheesh, S Ma, Z Huang, A Karpathy, A Khosla, M Bernstein, A C Berg, and L Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. Inte...

  75. [75]

    Towards Best Practices of Activation Patching in Language Models: Metrics and Methods, January 2024

    Fred Zhang and Neel Nanda. Towards Best Practices of Activation Patching in Language Models: Metrics and Methods, January 2024

  76. [76]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  77. [77]

    Claude sonnet 4.5

    Anthropic. Claude sonnet 4.5. Large language model, 2025

  78. [78]

    Logothetis

    Nikos K. Logothetis. The Underpinnings of the BOLD Functional Magnetic Resonance Imaging Signal.Journal of Neuroscience, 23(10):3963–3971, May 2003. ISSN 0270-6474, 1529-2401. doi: 10.1523/JNEUROSCI.23-10-03963.2003. 14

  79. [79]

    Biophysical and Physiological Origins of Blood Oxygenation Level-Dependent fMRI Signals.Journal of Cerebral Blood Flow & Metabolism, 32(7):1188– 1206, July 2012

    Seong-Gi Kim and Seiji Ogawa. Biophysical and Physiological Origins of Blood Oxygenation Level-Dependent fMRI Signals.Journal of Cerebral Blood Flow & Metabolism, 32(7):1188– 1206, July 2012. ISSN 0271-678X. doi: 10.1038/jcbfm.2012.23

  80. [80]

    Logothetis and Brian A

    Nikos K. Logothetis and Brian A. Wandell. Interpreting the BOLD Signal.Annual Review of Physiology, 66(V olume 66, 2004):735–769, February 2004. ISSN 0066-4278, 1545-1585. doi: 10.1146/annurev.physiol.66.082602.092845

Showing first 80 references.