pith. sign in

arxiv: 2606.06857 · v1 · pith:7E7RF4ZRnew · submitted 2026-06-05 · 💻 cs.CL

Interpreting Brain Responses to Language with Sparse Features from Language Models

Pith reviewed 2026-06-27 22:15 UTC · model grok-4.3

classification 💻 cs.CL
keywords sparse autoencoderslanguage modelsfMRIbrain encodingsurprisallanguage networkneural interpretation
0
0 comments X

The pith

Brain responses during language processing are best explained by the most general features encoded in language model representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Augmented Sparse Encoding Models, which replace dense language model hidden states with hierarchically organized sparse autoencoder features and add surprisal as a predictor. This framework recovers known voxel tunings to processing difficulty and meaning abstractness, identifies a new population tuned to people-related content, and shows that the fronto-temporal language network shares a common feature set while frontal regions are well explained by surprisal alone. The central result is that brain responses align preferentially with features capturing the most general information in LM representations rather than arbitrary or idiosyncratic ones.

Core claim

Augmented Sparse Encoding Models that substitute hierarchically-organized sparse autoencoder features from language models for dense hidden states, while including surprisal, recover prior voxel interpretations and identify a people-related population. The fronto-temporal human language network is predicted by a common set of these features across regions, with frontal areas relatively well explained by surprisal alone. Brain responses are best explained by the features that capture the most general information encoded in LM representations, indicating a nontrivial correspondence between brain and LM language representation.

What carries the argument

Augmented Sparse Encoding Models, which replace dense LM hidden states with hierarchically-organized sparse autoencoder features and add surprisal as an explicit predictor to distinguish primary from idiosyncratic variation in LM representations.

If this is right

  • Voxel populations can be interpreted as tuned to processing difficulty, meaning abstractness, and people-related content.
  • A common set of features predicts responses across the fronto-temporal language network.
  • Frontal regions are relatively well explained by surprisal alone even without LM features.
  • Brain responses are not predictable from an arbitrary set of LM features but align with those capturing general information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could extend to test whether the same general features predict responses across different language models or tasks.
  • It raises the possibility that brains selectively represent shared linguistic structure while filtering model-specific details.
  • Sparse features might enable more precise mapping of individual differences in language cortex by focusing on interpretable dimensions.

Load-bearing premise

The sparse autoencoder features extracted from one language model and training regime are stable enough to identify which features capture the most general information across language models.

What would settle it

Brain responses being predicted equally well or better by a random selection of sparse autoencoder features rather than the subset identified as capturing the most general LM information.

Figures

Figures reproduced from arXiv: 2606.06857 by Greta Tuckute, Kendrick Kay, Michael A. Lepori.

Figure 1
Figure 1. Figure 1: Overview of methods. (A) Each dot is a voxel from one sample participant, projected onto two PCs from prior work. Voxels selected for analysis in Study 1 and 2 are annotated. Color denotes each voxel’s correlation with sentence-frequency annotations derived from humans (Tuckute et al., 2025). (B) Visualization of LM feature spaces. Prior LM encoding models rely on residual stream features, whereas Augmente… view at source ↗
Figure 2
Figure 2. Figure 2: (A) Normalized encoding model predictivity of four voxel subtypes using different feature spaces. Processing-difficulty voxels show no predictivity benefit from LM representation features beyond surprisal, whereas abstract and concrete voxels benefit substantially. (B) Matryoshka SAE features selected by regressions on Abstract and Concrete voxels. Features form different subspaces for each subtype general… view at source ↗
Figure 3
Figure 3. Figure 3: (A) Generalization heatmap for “Ghost” voxels (20 per participant, 160 in total), testing whether signed Matryoshka SAE features used to predict one voxel (source) generalize to predicting other voxels (target). Voxel order is determined by a hierarchical agglomerative clustering algorithm to cluster voxels with similar generalization profiles. Mutually well-predicted voxels are denoted by the red square. … view at source ↗
Figure 4
Figure 4. Figure 4: (A) Normalized encoding model predictivity of five language fROIs. (B) Generalization heatmap for language fROI voxels (2,296 voxels in total across eight participants; see Appendix G), testing whether signed Matryoshka SAE features used to predict one voxel (source) generalize to predicting other voxels (target). Voxel order is sorted according to participant and fROI. (C) Quantification of the prevalence… view at source ↗
Figure 5
Figure 5. Figure 5: Augmented Sparse Encoding Models preferentially rely on general features to predict [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Replication of Fig. 2A using representations from layer 14 of [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Average support set sizes (number of features selected by LASSO regression) for [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Average support set sizes (number of features selected by LASSO regression) for fROI [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Regression predictivity for Ghost voxels using all feature sets. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Replication of Fig. 4 on an independent dataset. Following the procedure in Tuckute [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Signed Matryoshka feature prevalence across all participants and all language fROIs. We [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Predictivity of regressions restricted to using features from the most general Matryoshka [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Histogram of how often each of the Matryoshka features is selected by encoding models [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Average support set sizes (number of features selected by LASSO regression) per [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Performance of encoding models with feature sets restricted to individual Matryoshka [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Predictivity of regressions restricted to using features from the the most general Matryoshka [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Number of non-zero Matryoshka features firing on at least [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
read the original abstract

A central goal of cognitive neuroscience is to characterize the features that are represented by human language cortex. Artificial language models (LMs) have emerged as a powerful tool to address this challenge, but studies relating biological and artificial representations are often criticized as relating one black box to another. The present work introduces Augmented Sparse Encoding Models, an encoding framework that replaces dense LM hidden states with hierarchically-organized sparse autoencoder (SAE) features, while explicitly including surprisal as a predictor. Using this approach, we (i) produce interpretations of neural responses and (ii) test whether model-brain alignment reflects primary or idiosyncratic variation in LM representations. Using a high-field 7T fMRI dataset of eight participants listening to 200 linguistically diverse sentences, we first validate our modeling framework by recovering previous interpretations of voxel populations tuned to processing difficulty and meaning abstractness. We then interpret a previously-uncharacterized (but reliable) voxel population and find that it is tuned to people-related content. Next, we show that the fronto-temporal human language network is predicted by a common set of features across its constituent regions, but find that frontal regions are relatively well-explained by surprisal alone, even in the absence of LM-based features. Finally, we show that brain responses during language processing are not merely predictable from an arbitrary set of LM features. Rather, brain responses are best explained by the features that tend to capture the most general information encoded in LM representations, suggesting a nontrivial correspondence between brain and LM language representation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Augmented Sparse Encoding Models, which replace dense LM hidden states with hierarchically-organized sparse autoencoder (SAE) features while including surprisal as a predictor. On 7T fMRI data from eight participants listening to 200 sentences, the approach recovers prior voxel interpretations (processing difficulty, meaning abstractness), identifies a new reliable people-related population, shows that a common feature set predicts the fronto-temporal language network (with frontal regions relatively well-explained by surprisal alone), and reports that brain responses align best with SAE features capturing the most general LM information rather than arbitrary features.

Significance. If the central empirical claims hold after addressing the noted issues, the work offers a concrete method for moving beyond black-box LM-brain comparisons by linking neural responses to interpretable sparse features, while also quantifying the contribution of surprisal. The recovery of known populations plus the generality result would strengthen evidence for nontrivial correspondence between brain and LM representations.

major comments (3)
  1. [Abstract / Results] Abstract and final Results section: the claim that brain responses are best explained specifically by the 'most general' SAE features (rather than arbitrary ones) rests on ranking features within a single LM and training regime; the manuscript does not report training comparable SAEs on additional LMs, measuring feature overlap, or testing whether the general subset retains superior predictive power when the underlying LM changes, leaving the distinction between primary and idiosyncratic variation tied to one representation space.
  2. [Methods] Methods section: data exclusion criteria for participants or trials, voxel selection thresholds, and statistical controls for multiple comparisons when identifying the new people-related population are not described; without these, it is impossible to evaluate whether post-hoc selection or family-wise error issues affect the reliability of the recovered and novel interpretations.
  3. [Results] Results on encoding performance: the superiority of general over arbitrary features is presented as supporting a nontrivial correspondence, but the section does not report effect sizes, confidence intervals, or formal statistical tests comparing the two conditions, which are needed to establish that the difference is load-bearing rather than marginal.
minor comments (2)
  1. [Figure 2 / Methods] Figure legends and Methods: the number of SAE layers, dictionary size, and sparsity level are referenced but the exact hyperparameter values and sensitivity analyses are not tabulated, making it difficult to assess reproducibility of the hierarchical organization.
  2. [Introduction / Methods] Notation: the distinction between 'Augmented' encoding models and standard SAE encoding is introduced without an explicit equation contrasting the two predictor sets.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate missing methodological details and improved statistical reporting. The first comment concerns scope rather than an error in the reported analysis.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and final Results section: the claim that brain responses are best explained specifically by the 'most general' SAE features (rather than arbitrary ones) rests on ranking features within a single LM and training regime; the manuscript does not report training comparable SAEs on additional LMs, measuring feature overlap, or testing whether the general subset retains superior predictive power when the underlying LM changes, leaving the distinction between primary and idiosyncratic variation tied to one representation space.

    Authors: The analysis isolates the effect of feature generality versus arbitrariness inside the representation space of one standard LM. This directly tests whether alignment reflects primary rather than idiosyncratic variation within that space. Extending the comparison across multiple LMs would require substantial additional computation and is outside the present scope; we will add an explicit limitations paragraph noting this boundary. revision: partial

  2. Referee: [Methods] Methods section: data exclusion criteria for participants or trials, voxel selection thresholds, and statistical controls for multiple comparisons when identifying the new people-related population are not described; without these, it is impossible to evaluate whether post-hoc selection or family-wise error issues affect the reliability of the recovered and novel interpretations.

    Authors: We apologize for the omission from the main text. Participant and trial exclusion followed standard motion and outlier criteria (none excluded beyond these); voxels were selected at the top decile of cross-validated variance explained; multiple comparisons for the people-tuned population were controlled with FDR at q < 0.05. These details appear in the supplement; we will move them into the main Methods section. revision: yes

  3. Referee: [Results] Results on encoding performance: the superiority of general over arbitrary features is presented as supporting a nontrivial correspondence, but the section does not report effect sizes, confidence intervals, or formal statistical tests comparing the two conditions, which are needed to establish that the difference is load-bearing rather than marginal.

    Authors: We agree that effect-size and inferential statistics are required. The revised Results section will report Cohen’s d, participant-level 95 % confidence intervals, and paired t-test (or Wilcoxon) results comparing general-feature versus arbitrary-feature encoding performance across the eight participants. revision: yes

Circularity Check

0 steps flagged

Empirical encoding comparisons are independent of fitted brain parameters

full rationale

The paper's core results consist of empirical encoding performance comparisons between different SAE feature sets and surprisal on held-out fMRI data. These comparisons do not reduce by the paper's equations to quantities defined solely from parameters fitted to the same brain dataset. Feature generality is assessed from LM-internal properties prior to brain modeling, and validation against prior interpretations uses external benchmarks rather than self-referential definitions. No load-bearing step equates a claimed prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on standard linear encoding assumptions and the premise that SAE features trained on one LM generalize to brain data. No new physical entities are postulated.

free parameters (2)
  • SAE sparsity level and dictionary size
    Chosen during SAE training; affects which features are extracted and therefore which interpretations are possible.
  • Number of SAE layers and feature hierarchy
    Architectural choices that determine the set of candidate predictors fed to the encoding model.
axioms (2)
  • domain assumption Linear mapping from LM features plus surprisal to fMRI BOLD is sufficient to recover interpretable voxel tunings.
    Invoked throughout the encoding model validation and interpretation steps.
  • domain assumption The 'most general' LM features can be identified by some selection criterion that is independent of the brain data.
    Required for the final claim that brain responses align with general rather than idiosyncratic variation.

pith-pipeline@v0.9.1-grok · 5801 in / 1418 out tokens · 17251 ms · 2026-06-27T22:15:36.865387+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 11 canonical work pages · 3 internal anchors

  1. [1]

    AlKhamissi, B., Tuckute, G., Tang, Y ., Binhuraib, T. O. A., Bosselut, A., and Schrimpf, M. From language to cognition: How llms outgrow the human language network. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 24332–24350,

  2. [2]

    Antonello, R., Turek, J

    URL https: //transformer-circuits.pub/2025/attribution-graphs/methods.html. Antonello, R., Turek, J. S., V o, V ., and Huth, A. Low-dimensional structure in the space of language representations is reflected in brain responses.Advances in neural information processing systems, 34:8332–8344,

  3. [3]

    Generative causal testing to bridge data-driven models and scientific theories in language neuroscience.arXiv preprint arXiv:2410.00812,

    Antonello, R., Singh, C., Jain, S., Hsu, A., Guo, S., Gao, J., Yu, B., and Huth, A. Generative causal testing to bridge data-driven models and scientific theories in language neuroscience.arXiv preprint arXiv:2410.00812,

  4. [4]

    URL https://www.science.org/doi/10.1126/science. aav9436. Benara, V ., Singh, C., Morris, J. X., Antonello, R. J., Stoica, I., Huth, A. G., and Gao, J. Crafting interpretable embeddings for language neuroscience by asking llms questions.Advances in neural information processing systems, 37:124137,

  5. [5]

    Bussmann, B., Nabeshima, N., Karvonen, A., and Nanda, N

    https://transformer-circuits.pub/2023/monosemantic- features/index.html. Bussmann, B., Nabeshima, N., Karvonen, A., and Nanda, N. Learning multi-level features with matryoshka sparse autoencoders. InForty-second International Conference on Machine Learning,

  6. [6]

    G., Malik-Moraleda, S., Tuckute, G., and Fedorenko, E

    13 de Varda, A. G., Malik-Moraleda, S., Tuckute, G., and Fedorenko, E. Multilingual computational models reveal shared brain responses to 21 languages.bioRxiv, pp. 2025–02,

  7. [7]

    org/doi/prev/20100421-aop/pdf/10.1152/jn.00032.2010

    URL https://journals.physiology. org/doi/prev/20100421-aop/pdf/10.1152/jn.00032.2010. Fedorenko, E., Blank, I. A., Siegelman, M., and Mineroff, Z. Lack of selectivity for syntax relative to word meanings throughout the language network.Cognition, 203:104348,

  8. [8]

    Guo, D., Wu, J., and Yiu, S. M. Sparse autoencoders map brain-llm alignment onto cortical semantic topography.arXiv preprint arXiv:2605.23035,

  9. [9]

    Hu, J., Small, H., Kean, H., Takahashi, A., Zekelman, L., Kleinman, D., Ryan, E., Nieto-Castañón, A., Ferreira, V ., and Fedorenko, E

    URL https: //www.biorxiv.org/content/10.1101/2024.12.26.629294v1.abstract. Hu, J., Small, H., Kean, H., Takahashi, A., Zekelman, L., Kleinman, D., Ryan, E., Nieto-Castañón, A., Ferreira, V ., and Fedorenko, E. Precision fmri reveals that the language-selective network supports both phrase-structure building and lexical access during language production.Ce...

  10. [10]

    Kleinman, T. W. and Goldstein, A. Back to the feature: Toward a feature-centric account of brain– lm alignment. InICLR 2026 Workshop on Representational Alignment (Re {\textasciicircum} 4-Align). Kumar, S., Sumers, T. R., Yamakoshi, T., Goldstein, A., Hasson, U., Norman, K. A., Griffiths, T. L., Hawkins, R. D., and Nastase, S. A. Shared functional special...

  11. [11]

    Lamarre, M., Chen, C., and Deniz, F

    URLhttps: //www.nature.com/articles/s41467-024-49173-5. Lamarre, M., Chen, C., and Deniz, F. Attention weights accurately predict language representations in the brain. InFindings of the Association for Computational Linguistics: EMNLP 2022, pp. 4513–4529,

  12. [12]

    From atoms to trees: Building a structured feature forest with hierarchical sparse autoencoders.arXiv preprint arXiv:2602.11881,

    Luo, Y ., Zhan, Y ., Jiang, J., Liu, T., Wu, M., Zhou, Z., and Dong, B. From atoms to trees: Building a structured feature forest with hierarchical sparse autoencoders.arXiv preprint arXiv:2602.11881,

  13. [13]

    and Toneva, M

    Merlin, G. and Toneva, M. Language models and brains align due to more than next-word prediction and word-level information. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 18431–18454. Association for Computational Linguistics,

  14. [14]

    Miao, Z., Jung, H., Kragel, P

    URLhttps://aclanthology.org/2024.emnlp-main.1024/. Miao, Z., Jung, H., Kragel, P. A., Bo, K., Sadil, P., Lindquist, M. A., and Wager, T. D. Common and distinct neural correlates of social interaction processing and theory of mind in narratives.Nature Communications,

  15. [15]

    Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

    URL https://elifesciences.org/articles/77599. Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V ., Kramár, J., and Nanda, N. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders.arXiv preprint arXiv:2407.14435,

  16. [16]

    Saxe, R., Brett, M., and Kanwisher, N

    doi: 10.1016/S1053-8119(03) 00230-1. Saxe, R., Brett, M., and Kanwisher, N. Divide and conquer: a defense of functional localizers. Neuroimage, 30(4):1088–1096,

  17. [17]

    Gemma 2: Improving Open Language Models at a Practical Size

    Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ramé, A., et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118,

  18. [18]

    Language in brains, minds, and machines.An- nual Review of Neuroscience, 47, 2024a

    Tuckute, G., Kanwisher, N., and Fedorenko, E. Language in brains, minds, and machines.An- nual Review of Neuroscience, 47, 2024a. URL https://www.annualreviews.org/content/ journals/10.1146/annurev-neuro-120623-101142?TRACK=RSS. Tuckute, G., Sathe, A., Srikant, S., Taliaferro, M., Wang, M., Schrimpf, M., Kay, K., and Fedorenko, E. Driving and suppressing ...

  19. [19]

    doi: https://doi.org/10.1101/2025.05.21. 65533. Wehbe, L., Blank, I. A., Shain, C., Futrell, R., Levy, R., von der Malsburg, T., Smith, N., Gibson, E., and Fedorenko, E. Incremental language comprehension difficulty predicts activity in the language network but not the multiple demand network.Cerebral Cortex, 31(9):4006–4023,

  20. [20]

    and Gallant, J

    Zeng, A. and Gallant, J. Disentangling superpositions: Interpretable brain encoding model with sparse concept atoms.bioRxiv, pp. 2025–11,

  21. [21]

    functional localizer

    17 A Detailed Voxel Selection Procedure We selected voxels of interest in two main ways: one based on the two principal components (PCs) of sentence-evoked responses identified by Tuckute et al. (2025), and the other one based on whether a voxel is part of the fronto-temporal language network (Fedorenko et al., 2010). PC-derived subtypes (Hard-to-Process,...

  22. [22]

    deduplicated

    D Feature Set Support Size In Fig. 7, we present the average support size (i.e., number of features) resulting from LASSO- based feature selection using residual stream, JumpReLU, and Matryoshka feature sets forAbstract andConcretevoxel subtypes. When including processing difficulty voxel subtypes, the average Matryoshka support size is 18 features. In Fi...

  23. [23]

    These features broadly reflect the interpretations of the PCs that have been identified in Tuckute et al. (2025). Features -79, -94, +389, and +40 are all concordant withAbstractvoxels, which are predominantly located in the language network. Features 44 and 71 are both driven more by sentences comprised of fewer tokens, and are thus suppressed by more co...

  24. [24]

    N non-zero

    24 128 512 2048 8192 32768 Feature Bin 0 2 4 6 8 10 12Average Number of Features Feature Granularity by Dataset Dataset Hard to Process Easy to Process Abstract Concrete Ghost Figure 14: Average support set sizes (number of features selected by LASSO regression) per Matryoshka bin when predicting voxels in each subtype. All Datasets 0.0 0.2 0.4 0.6 0.8Nor...

  25. [25]

    Max fires

    any single non-zero feature in the bin fires on. Bin Range N feats N non-zero Min fires Max fires 1 0–128 128 122 (95.3%) 1 200 2 128–512 384 342 (89.1%) 1 200 3 512–2048 1,536 1,186 (77.2%) 1 199 4 2048–8192 6,144 3,397 (55.3%) 1 197 5 8192+ 24,576 5,921 (24.1%) 1 200 1 25 50 75 100 125 150 175 200 Fires on ≥ N sentences 0 1000 2000 3000 4000 5000 6000Nu...