pith. sign in

arxiv: 2606.19081 · v1 · pith:2P77Y643new · submitted 2026-06-17 · 🧬 q-bio.NC · cs.HC

Retrieval-Based Brain Decoding by Alignment, not Complexity

Pith reviewed 2026-06-26 18:34 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.HC
keywords brain decodingfMRIcontrastive learningfoundation modelslinear modelsembeddingsmultimodalretrieval
0
0 comments X

The pith

Linear contrastive decoders map fMRI activity to foundation model embeddings more accurately than ridge regression or non-linear alternatives across images, text, and sound.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to map fMRI brain activity onto the high-dimensional embedding spaces used by foundation models for vision, language, and audio. It argues that although brain computations are non-linear at fine scales, fMRI averaging and noise produce an effectively linear observable signal. Experiments show that training a linear decoder with a contrastive objective yields better retrieval performance than either ridge regression or standard non-linear networks. This pattern holds consistently across multiple datasets and sensory modalities. A reader would care because the finding implies that decoding success depends more on the alignment objective than on added model complexity.

Core claim

Although neural computations are highly non-linear at the microscale, fMRI measurements average signals across space and time, further smoothed by noise, effectively linearizing the observable representation. As a result, linear contrastive decoders consistently outperform ridge regression and standard non-linear alternatives when mapping fMRI activity to the embedding spaces of foundation models, and these results generalize across images, text, and sound.

What carries the argument

Linear contrastive decoder that aligns fMRI activity vectors with foundation-model embeddings via a contrastive training objective.

If this is right

  • Decoding gains arise more from the choice of training objective than from architectural complexity.
  • Contrastive-linear models constitute a principled strategy for brain decoding.
  • The same linear-contrastive approach succeeds for vision, language, and audio stimuli.
  • Performance improvements are expected to generalize across additional fMRI datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Retrieval-based decoding may extend naturally to other coarse-grained neuroimaging signals such as EEG or MEG.
  • Semantic embedding spaces appear to capture a large fraction of the variance observable in averaged fMRI responses.
  • Linear alignment could serve as a baseline for testing whether finer-scale recordings require non-linear mappings.

Load-bearing premise

fMRI measurements average signals across space and time and are smoothed by noise, which linearizes the observable brain representation.

What would settle it

Non-linear models trained with the identical contrastive objective would need to produce reliably higher retrieval accuracy than the linear version on held-out fMRI data from at least two modalities.

Figures

Figures reproduced from arXiv: 2606.19081 by Matteo Ciferri, Matteo Ferrante, Nicola Toschi.

Figure 1
Figure 1. Figure 1: The same linear contrastive model is employed across three experimental conditions, dif fering only in the stimulus modality (audio, textual, or visual). For each modality, neural responses from fMRI are aligned through subject-specific linear transformations and mapped into the corresponding stimulus embedding space (obtained from a pretrained foundation model such as CLIP for images, CL AP for audios, or… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Each heatmap represents the cosine similarity between predicted and ground-truth stimulus embeddings, computed sample-by-sample. Results are shown for three datasets (NSD, HUTH, GTZAN) and two models: a linear Ridge regression (lef t column) and the best contrastive learning model (the linear one, right column). The diagonal reflects correct predictions with high similarity between corresponding stimul… view at source ↗
Figure 3
Figure 3. Figure 3: Quantitative bar char ts to visualize decoding results. Stars above the bars reveal significance, according to the table in the Appendix. Double stars indicate pvalue lower than 1e-10. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Random samples of brain decoding results. For each panel, the target column shows the ground-truth stimulus (music track, image, or sentence, depending on the modality), while the neighbor columns display the top retrieved candidates from the model’s latent space based on cosine similarity. (a) Retrieval of images viewed by par ticipants. (b) Retrieval of text/sentences corresponding to the neural response… view at source ↗
read the original abstract

A prominent theory in cognitive science suggests that concepts in the brain are organized as high-dimensional vectors, with semantic meaning captured by directions and relative angles in this space. Brain decoding is the effort of reconstructing or retrieving stimuli (or their representations) from neural activity and involves finding a function that approximates how the brain represents concepts. This motivates the investigation of contrastive objectives as biologically plausible candidates to reverse the brain loss function. In this work, we study how functional MRI (fMRI) activity can generally be mapped with the embedding spaces of foundation models in vision, language, and audio. Although neural computations are highly non-linear at the microscale, fMRI measurements average signals across space and time, further smoothed by noise, effectively linearizing the observable representation. Consistent with these views, our experiments across multiple datasets demonstrate that linear contrastive decoders consistently outperform ridge regression and standard non-linear alternatives, and that these results generalize across images, text, and sound. These findings indicate that decoding gains arise more from the choice of training objective than from architectural complexity, pointing to contrastive-linear models as a principled strategy for brain decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that fMRI-based brain decoding to foundation model embeddings (vision, language, audio) is best performed by linear contrastive decoders, which outperform both ridge regression and standard non-linear alternatives across multiple datasets. It attributes the gains primarily to the choice of contrastive training objective rather than architectural complexity, and argues that fMRI's spatial/temporal averaging and noise effectively linearize the observable neural representations, making contrastive-linear models a principled decoding strategy.

Significance. If the central comparison is shown to isolate objective from architecture, the result would be significant: it supplies cross-modal evidence that simple linear models suffice for retrieval-based decoding when trained contrastively, offers a computationally efficient alternative to complex non-linear decoders, and aligns with the view that contrastive objectives are biologically plausible for approximating brain representations. The generalization across images, text, and sound is a clear strength of the reported experiments.

major comments (1)
  1. [Abstract] Abstract: the central interpretation that 'decoding gains arise more from the choice of training objective than from architectural complexity' requires that the 'standard non-linear alternatives' were trained under the identical contrastive objective used for the linear decoders. The provided description gives no indication that non-linear contrastive variants were evaluated; without this control the results remain consistent with either objective or model-class effects and therefore do not support the claimed separation of factors.
minor comments (2)
  1. Methods and results sections should report dataset sizes, number of subjects, exact statistical tests, and p-values supporting the outperformance claims.
  2. The precise architectures, loss functions, and hyper-parameter regimes for the non-linear baselines need explicit description to allow replication of the comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive comment. The observation correctly identifies a limitation in how the abstract frames the separation between training objective and model complexity. We address the point directly below and will revise the manuscript to ensure the claims accurately reflect the experiments performed.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central interpretation that 'decoding gains arise more from the choice of training objective than from architectural complexity' requires that the 'standard non-linear alternatives' were trained under the identical contrastive objective used for the linear decoders. The provided description gives no indication that non-linear contrastive variants were evaluated; without this control the results remain consistent with either objective or model-class effects and therefore do not support the claimed separation of factors.

    Authors: We agree with the referee's assessment. The non-linear alternatives evaluated in the manuscript were trained with standard regression objectives (primarily mean-squared error), not the contrastive loss used for the linear decoders. Consequently, the reported results show that linear contrastive models outperform both ridge regression and typical non-linear regression models, but they do not isolate the contribution of the objective from architectural capacity. We will revise the abstract and the corresponding discussion sections to remove the stronger claim of separation and instead state that contrastive training yields strong retrieval performance even when restricted to linear mappings, outperforming more complex non-linear models trained under conventional regression losses. This revision will be made in the next version of the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivation chain

full rationale

The paper reports experimental results comparing linear contrastive decoders against ridge regression and non-linear baselines on fMRI datasets for image, text, and audio modalities. No equations, first-principles derivations, or load-bearing self-citations are present in the provided text that reduce any claimed result to its own inputs by construction. The central claim rests on observed performance differences, which are falsifiable via replication on held-out data and do not involve fitted parameters renamed as predictions or ansatzes smuggled through citations. This is a standard empirical study; the derivation chain is absent, so circularity score is 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that fMRI effectively linearizes neural representations; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption fMRI measurements average signals across space and time, further smoothed by noise, effectively linearizing the observable representation
    Invoked to justify why linear models are appropriate despite non-linear microscale computations.

pith-pipeline@v0.9.1-grok · 5726 in / 1190 out tokens · 23106 ms · 2026-06-26T18:34:43.041328+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 1 linked inside Pith

  1. [1]

    Why concepts are (probably) vectors

    Steven T Piantadosi, Dyana CY Muller, Joshua S Rule, Karthikeya Kaushik, Mark Gorenstein, Elena R Leib, and Emily Sanford. Why concepts are (probably) vectors. Trends in Cognitive Sciences , 28(9):844–856, 2024

  2. [2]

    Bassett, et al

    Erfan Nozari, Dani S. Bassett, et al. Macroscopic resting-state brain dynamics are best described by linear models. Nature Biomedical Engineering , 8:7–8, 2024

  3. [3]

    Reconstructing visual experiences from brain activity evoked by natural movies

    Shinji Nishimoto, An T Vu, Thomas Naselaris, Yuval Benjamini, Bin Yu, and Jack L Gallant. Reconstructing visual experiences from brain activity evoked by natural movies. Current biology , 21(19):1641–1646, 2011

  4. [4]

    Huth, Willem A

    Alexander G. Huth, Willem A. de Heer, Thomas L. Griffiths, et al. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature, 532(7600):453–458, 2016

  5. [5]

    Through their eyes: Multi- subject brain decoding with simple alignment techniques

    Matteo Ferrante, Tommaso Boccato, Furkan Ozcelik, Rufin VanRullen, and Nicola Toschi. Through their eyes: Multi- subject brain decoding with simple alignment techniques. Imaging Neuroscience, 2:1–21, 05 2024

  6. [6]

    Scaling laws for de- coding images from brain activity, 2025

    Hubert Banville, Y ohann Benchetrit, Stéphane d’Ascoli, Jérémy Rapin, and Jean-Rémi King. Scaling laws for de- coding images from brain activity, 2025

  7. [7]

    Natural scene reconstruction from fmri signals using generative latent diffu- sion

    Furkan Ozcelik and Rufin VanRullen. Natural scene reconstruction from fmri signals using generative latent diffu- sion. Scientific Reports , 13(1):15666, 2023

  8. [8]

    Mind reader: Reconstructing complex images from brain activities

    Sikun Lin, Thomas Sprague, and Ambuj K Singh. Mind reader: Reconstructing complex images from brain activities. Advances in Neural Information Processing Systems , 35:29624–29636, 2022

  9. [9]

    Reconstructing the mind’s eye: fmri-to-image with con- trastive learning and diffusion priors

    Paul Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen, Aidan Dempster, Nathalie Verlinde, Elad Yundler, David Weisberg, Kenneth Norman, et al. Reconstructing the mind’s eye: fmri-to-image with con- trastive learning and diffusion priors. Advances in Neural Information Processing Systems , 36:24705–24728, 2023

  10. [10]

    Dream: Visual decoding from reversing human visual system

    Weihao Xia, Raoul De Charette, Cengiz Oztireli, and Jing-Hao Xue. Dream: Visual decoding from reversing human visual system. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages 8226– 8235, 2024

  11. [11]

    Cinematic mindscapes: High-quality video reconstruction from brain activity

    Zijiao Chen, Jiaxin Qing, and Juan Helen Zhou. Cinematic mindscapes: High-quality video reconstruction from brain activity. Advances in Neural Information Processing Systems , 36:24841–24858, 2023

  12. [12]

    T ang, A

    J. T ang, A. LeBel, S. Jain, et al. Semantic reconstruction of continuous language from non-invasive brain recordings. Nature Neuroscience, 26:858–866, 2023

  13. [13]

    Optimizing fmri data acquisition for decoding natural speech with limited participants

    Louis Jalouzot, Alexis Thual, Y air Lakretz, Christophe Pallier, and Bertrand Thirion. Optimizing fmri data acquisition for decoding natural speech with limited participants. arXiv preprint arXiv:2505.21304 , 2025

  14. [14]

    Denk, Yu T akagi, T akuya Matsuyama, Andrea Agostinelli, Tomoya Nakai, Christian Frank, and Shinji Nishimoto

    Timo I. Denk, Yu T akagi, T akuya Matsuyama, Andrea Agostinelli, Tomoya Nakai, Christian Frank, and Shinji Nishimoto. Brain2music: Reconstructing music from human brain activity, 2023. 8

  15. [15]

    R&b-rhythm and brain: Cross-subject decoding of music from human brain activity

    Matteo Ciferri, Matteo Ferrante, and Nicola Toschi. R&b-rhythm and brain: Cross-subject decoding of music from human brain activity. Neural Networks , page 109195, 2026

  16. [16]

    Generative language reconstruction from brain recordings

    Ziyi Y e, Qingyao Ai, Yiqun Liu, Maarten de Rijke, Min Zhang, Christina Lioma, and Tuukka Ruotsalo. Generative language reconstruction from brain recordings. Communications Biology , 8(1):346, 2025

  17. [17]

    Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding

    Yulong Liu, Y ongqiang Ma, Wei Zhou, Guibo Zhu, and Nanning Zheng. Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding. arXiv preprint arXiv:2302.12971 , 2023

  18. [18]

    Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data

    Paul S Scotti, Mihir Tripathy, Cesar Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A Norman, et al. Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data. arXiv preprint arXiv:2403.11207 , 2024

  19. [19]

    Umbrae: Unified multimodal brain decoding

    Weihao Xia, Raoul de Charette, Cengiz Oztireli, and Jing-Hao Xue. Umbrae: Unified multimodal brain decoding. In European Conference on Computer Vision , pages 242–259. Springer, 2024

  20. [20]

    Dynadiff: Single-stage decoding of images from contin- uously evolving fmri

    Marlène Careil, Y ohann Benchetrit, and Jean-Rémi King. Dynadiff: Single-stage decoding of images from contin- uously evolving fmri. arXiv preprint arXiv:2505.14556 , 2025

  21. [21]

    Mapping whisper representa- tions to human ecog responses with interpretable time-resolved neural encoding

    Matteo Ciferri, Tommaso Boccato, Michal Olak, Matteo Ferrante, and Nicola Toschi. Mapping whisper representa- tions to human ecog responses with interpretable time-resolved neural encoding. arXiv preprint arXiv:2606.02305, 2026

  22. [22]

    Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding

    Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, and Juan Helen Zhou. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 22710–22720, 2023

  23. [23]

    A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence

    Emily J Allen, Ghislain St-Yves, Yihan Wu, Jesse L Breedlove, Jacob S Prince, Logan T Dowdle, Matthias Nau, Brad Caron, Franco Pestilli, Ian Charest, et al. A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence. Nature neuroscience, 25(1):116–126, 2022

  24. [24]

    Glmdenoise: a fast, auto- mated technique for denoising task-based fmri data

    Kendrick Kay, Ariel Rokem, Jonathan Winawer, Robert Dougherty, and Brian Wandell. Glmdenoise: a fast, auto- mated technique for denoising task-based fmri data. Frontiers in Neuroscience , 7, 2013

  25. [25]

    Improving the accuracy of single-trial fmri response estimates using glmsingle

    Jacob S Prince, Ian Charest, Jan W Kurzawski, John A Pyles, Michael J T arr, and Kendrick N Kay. Improving the accuracy of single-trial fmri response estimates using glmsingle. eLife, 11:e77599, nov 2022

  26. [26]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 87 48–8763. PmLR, 2021

  27. [27]

    LeBel, L

    A. LeBel, L. Wagner, S. Jain, et al. A natural language fmri dataset for voxelwise encoding models. Scientific Data , 10:555, 2023

  28. [28]

    The llama 3 herd of models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Y ang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints , pages arXiv–2407, 2024

  29. [29]

    Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain)

    Mariya Toneva and Leila Wehbe. Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain). Advances in neural information processing systems , 32, 2019

  30. [30]

    Music genre neuroimaging dataset

    Tomoya Nakai, Naoko Koide-Majima, and Shinji Nishimoto. Music genre neuroimaging dataset. Data in Brief , 40:107675, 2022

  31. [31]

    Clap learning audio concepts from natural language supervision

    Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 1–5. IEEE, 2023

  32. [32]

    Evidence for compositionality in fmri visual representations via brain algebra

    Matteo Ferrante, Tommaso Boccato, Nicola Toschi, and Rufin VanRullen. Evidence for compositionality in fmri visual representations via brain algebra. Communications Biology , 8(1):1263, 2025

  33. [33]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning , pages 1597–1607. PmLR, 2020

  34. [34]

    Different scaling of linear models and deep learning in ukbiobank brain images versus machine-learning datasets

    Marc-Andre Schulz, BT Thomas Y eo, Joshua T Vogelstein, Janaina Mourao-Miranada, Jakob N Kather, Konrad Ko- rding, Blake Richards, and Danilo Bzdok. Different scaling of linear models and deep learning in ukbiobank brain images versus machine-learning datasets. Nature communications, 11(1):4238, 2020

  35. [35]

    Semantic language decoding across participants and stimulus modalities

    Jerry T ang and Alexander G Huth. Semantic language decoding across participants and stimulus modalities. Cur- rent Biology , 35(5):1023–1032, 2025

  36. [36]

    Tribe: Trimodal brain encoder for whole-brain fmri response prediction

    Stéphane d’Ascoli, Jérémy Rapin, Y ohann Benchetrit, Hubert Banville, and Jean-Rémi King. Tribe: Trimodal brain encoder for whole-brain fmri response prediction. arXiv preprint arXiv:2507.22229 , 2025

  37. [37]

    Across-subject ensemble-learning alleviates the need for large samples for fmri decoding

    Himanshu Aggarwal, Liza Al-Shikhley, and Bertrand Thirion. Across-subject ensemble-learning alleviates the need for large samples for fmri decoding. In International Conference on Medical Image Computing and Computer- Assisted Intervention , pages 35–45. Springer, 2024

  38. [38]

    Aligning brain functions boosts the decoding of visual semantics in novel subjects

    Alexis Thual, Y ohann Benchetrit, Felix Geilert, Jérémy Rapin, Iurii Makarov, Hubert Banville, and Jean-Rémi King. Aligning brain functions boosts the decoding of visual semantics in novel subjects. arXiv preprint arXiv:2312.06467 , 2023

  39. [39]

    Identity

    Rafael Yuste, Sara Goering, Blaise Agüera Y Arcas, Guoqiang Bi, Jose M Carmena, Adrian Carter, Joseph J Fins, Phoebe Friesen, Jack Gallant, Jane E Huggins, et al. Four ethical priorities for neurotechnologies and ai. Nature, 551(7679):159–163, 2017. 9 A Statements Ethics Statement This study uses only publicly available datasets. No new human-subject data...