pith. sign in

arxiv: 2605.17198 · v1 · pith:7YJVU6BGnew · submitted 2026-05-16 · 🧬 q-bio.NC · cs.CV

MIRAGE: Robust multi-modal architectures translate fMRI-to-image models from vision to mental imagery

Pith reviewed 2026-05-20 13:42 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.CV
keywords mental image reconstructionfMRI decodingmulti-modal featuresdiffusion modelNSD-Imageryvision-to-imagery transferlinear backbone
0
0 comments X

The pith

A linear backbone with multi-modal text and image features lets a diffusion model reconstruct mental images from fMRI after training only on external visual stimuli.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that high performance on reconstructing seen images from brain activity does not automatically transfer to internally generated mental images, and that some existing decoders fail on the latter task. It introduces MIRAGE to close this gap by feeding a linear combination of text and image features into a diffusion model, then demonstrates superior results on the NSD-Imagery benchmark via both quantitative metrics and human ratings. The central finding is that the right architecture makes large external vision datasets usable for mental-image decoding, removing the need for direct mental-imagery training data. This matters because successful cross-decoding would let researchers study visual thought without requiring participants to view external stimuli during every scan.

Core claim

MIRAGE trains on large-scale datasets of external visual stimuli to decode mental images from brain activity. It uses a linear backbone that combines multi-modal text features with both high- and low-level image features as conditioning input to a diffusion model. On the NSD-Imagery benchmark this yields state-of-the-art reconstructions according to feature-space metrics and human raters. Ablation experiments indicate that performance peaks when image features are kept low-dimensional and when text guidance is included alongside both high- and low-level visual features.

What carries the argument

The MIRAGE linear backbone that fuses multi-modal text and image features to condition a diffusion model for fMRI-to-image translation.

If this is right

  • Mental-image reconstruction reaches state-of-the-art levels without any direct training on internally generated imagery data.
  • Low-dimensional image features plus text and both high- and low-level visual cues produce the most accurate mental-image outputs.
  • Existing large-scale vision datasets become viable training resources for mental-image decoders once the architecture is chosen appropriately.
  • The gap between seen-image and mental-image decoding performance can be closed by explicit multi-modal conditioning rather than by scaling model size alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the transfer works, brain-computer interfaces could visualize a person's current visual thought without requiring them to look at matching external pictures.
  • The same multi-modal conditioning strategy might extend to decoding other internal states such as auditory imagery or spatial navigation.
  • Future tests could check whether the low-dimensional feature preference holds when the diffusion model is replaced by a different generative backbone.
  • The result implies that mental imagery and external vision share a common low-dimensional representational subspace that can be read out with modest additional guidance.

Load-bearing premise

Brain activity patterns evoked by external visual stimuli are similar enough to those generated during mental imagery that a decoder trained on the former can be applied to the latter.

What would settle it

A controlled experiment in which participants generate mental images while scanned, a model is trained directly on those mental-image fMRI pairs, and that model produces reconstructions rated higher by humans or closer in feature space than MIRAGE outputs on the same test set.

Figures

Figures reproduced from arXiv: 2605.17198 by Cesar Kadir Torrico Villanueva, Jonathan Xu, Jordyn Ojeda, Paul S. Scotti, Reese Kneeland, Shuhb Khanna, Thomas Naselaris.

Figure 1
Figure 1. Figure 1: MIRAGE (ours) vs MindEye2 [1] reconstructions of an imagined image from fMRI brain activity. 1 Introduction The ability to decode and reconstruct mental images—internally generated visual representations not driven by sensory input—from brain activity has tremendous potential for downstream applications such as brain-computer interfaces and medical diagnostics for patients with disorders of communica￾tion … view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison of reconstruction [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (B) Human similarity scores for simple and complex stimuli: X-axis = vision, Y-axis = imagery; each point is the mean over 12 samples (larger bold points are the overall means), colored/shaped by method. PCA-fit slopes closer to unity indicate tighter imagery–vision correspondence; dashed unity line shown. 2.3 Ablation Study We systematically ablated model components to identify which were most important f… view at source ↗
Figure 4
Figure 4. Figure 4: (A) Head-to-head human similarity score results for the conceptual stimuli. The Y-axis represents the similarity score advantage (difference between target method’s score and the alternative, on the radial X-axis); a larger colored polygon area indicates a stronger advantage, and the dashed circle at unity denotes equal performance. MIRAGE outperforms all other methods (p < 0.001). (B) Ablation analyses: m… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the tasks utilized for the NSD-Imagery benchmark. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: MIRAGE training pipeline. (1) Brain activity (7T fMRI) acquired as NSD subjects view > 10K stimuli. (2) Stimuli are passed to VDVAE encoder [37] yielding (1 × 91168) latents (3) LLaVA v1.5-13B [62, 63] generates synthetic captions. (4) Captions are encoded into CLIP ViT￾bigG/14 text embeddings (77 × 1280) [64]. (5) Stimuli are also passed through the CLIP ViT-L/14 image encoder [23] to generate both CLS to… view at source ↗
Figure 7
Figure 7. Figure 7: MIRAGE inference pipeline. (1) The NSD subjects imagine stimuli from letter cues under 7T fMRI. (2) A set of feature embeddings is predicted by passing the measured fMRI brain activity through our frozen ridge regression models. (3) The VDVAE [37] latents are reconstructed into a low-level image. (4) The image is filtered to boost its structure. (5) The filtered low-level image, decoded image embedding, an… view at source ↗
read the original abstract

To be useful for downstream applications, vision decoding models that are trained to reconstruct seen images from human brain activity must be able to generalize to internally generated visual representations, i.e., mental images. In an analysis of the recently released NSD-Imagery dataset, we demonstrated that while some modern vision decoders can perform quite well on mental image reconstruction, some fail, and that state-of-the-art (SOTA) performance on seen image reconstruction is no guarantee of SOTA performance on mental image reconstruction. Motivated by these findings, we developed MIRAGE, a method explicitly designed to train on vision datasets and cross-decode mental images from brain activity. MIRAGE employs a linear backbone and multi-modal text and image features as input to a diffusion model. Feature metrics and human raters establish MIRAGE as SOTA for mental image reconstruction on the NSD-Imagery benchmark. With ablation analysis we show that mental image reconstruction works best when decoders use image features with relatively few dimensions and include guidance from text-based and both high- and low-level image-based features. Our work indicates that--given the right architecture--existing large-scale datasets using external stimuli are viable training data for decoding mental images, and warrant optimism about the future success and utility of mental image reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MIRAGE, a linear multi-modal architecture that trains on external vision datasets and uses text plus image features (high- and low-level) as input to a diffusion model for reconstructing mental images from fMRI. On the NSD-Imagery benchmark it reports state-of-the-art performance via feature metrics and human ratings, with ablations showing best results when image features are low-dimensional and guidance from text and both high- and low-level image features is included. The central conclusion is that large-scale seen-image datasets are viable training data for mental-image decoding.

Significance. If the reported generalization holds, the result would be significant for brain decoding: it supplies an explicit architecture and training recipe that bridges seen-image and mental-image regimes, supplies concrete ablation evidence on which feature types matter, and offers a relatively simple linear backbone that may aid interpretability. The work also supplies a falsifiable prediction that performance on cued mental imagery should transfer when the same linear mapping is applied to novel, non-cued internal content.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Evaluation): the claim that 'existing large-scale datasets using external stimuli are viable training data for decoding mental images' is load-bearing for the paper's main contribution, yet the NSD-Imagery mental-imagery trials are cued by previously viewed natural scenes. This leaves open the possibility that brain activity contains recall components aligned with the training distribution rather than arbitrary internally generated content; the reported ablations do not isolate this factor.
  2. [Results] Results section: the assertion of SOTA performance is supported only by the statement that 'feature metrics and human raters establish MIRAGE as SOTA'; no numerical values, baseline scores, error bars, or statistical tests appear in the abstract or summary text, preventing direct verification of the performance gap.
minor comments (2)
  1. [Methods] Methods: the exact procedure for obtaining and reducing the dimensionality of the image features (one of the free parameters listed in the axiom ledger) should be stated explicitly, including the source embedding model and any learned projection.
  2. [Figure 1 and §3] Figure captions and §3: the multi-modal input diagram should label each feature stream (text, high-level image, low-level image) and indicate whether the linear backbone is trained exclusively on vision data before zero-shot application to mental-imagery fMRI.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. These help us clarify the scope of our generalization claims and strengthen the presentation of quantitative results. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Evaluation): the claim that 'existing large-scale datasets using external stimuli are viable training data for decoding mental images' is load-bearing for the paper's main contribution, yet the NSD-Imagery mental-imagery trials are cued by previously viewed natural scenes. This leaves open the possibility that brain activity contains recall components aligned with the training distribution rather than arbitrary internally generated content; the reported ablations do not isolate this factor.

    Authors: We agree that the NSD-Imagery mental-imagery trials are cued by previously viewed scenes and therefore may engage recall processes in addition to internally generated content. Our ablations examine feature-type contributions rather than isolating recall versus pure generation. Nevertheless, the central result remains that a linear multi-modal model trained exclusively on external vision data successfully decodes these mental images, supporting the viability of large-scale seen-image datasets for mental-image reconstruction on this benchmark. We will revise the abstract and add a paragraph in the Discussion to explicitly note the cued nature of the imagery, distinguish it from uncued internal content, and frame this as a limitation for future work. revision: partial

  2. Referee: [Results] Results section: the assertion of SOTA performance is supported only by the statement that 'feature metrics and human raters establish MIRAGE as SOTA'; no numerical values, baseline scores, error bars, or statistical tests appear in the abstract or summary text, preventing direct verification of the performance gap.

    Authors: We agree that the abstract and high-level summary would be strengthened by including concrete numerical comparisons. The full Results section already reports detailed feature-metric values, baseline scores, human ratings, and statistical comparisons in tables and figures. We will revise the abstract to include representative quantitative results (e.g., key metric improvements and human preference rates) with pointers to the supporting tables, enabling immediate verification of the SOTA claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is empirically grounded

full rationale

The paper trains MIRAGE (linear backbone + multi-modal features into diffusion model) on external-stimulus vision datasets and reports SOTA metrics on the held-out NSD-Imagery mental-imagery benchmark. The central claim—that such training data are viable for mental-image decoding—rests on cross-domain empirical performance rather than any self-definitional mapping, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations or sections in the provided text reduce the generalization result to a quantity defined by the model’s own fitted values. The architecture choices and ablation results (low-dimensional image features plus text/high-low guidance) are presented as independent design decisions whose success is measured externally. This is the most common honest non-finding for a methods paper whose test set is distinct from its training distribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that pre-trained vision and language models produce features that can be linearly mapped from fMRI to guide diffusion-based generation of mental images, plus the domain assumption that the NSD-Imagery dataset faithfully represents mental imagery.

free parameters (1)
  • dimensionality of image features
    Ablation analysis identifies relatively few dimensions as optimal, implying this hyperparameter is selected to fit performance on the benchmark.
axioms (1)
  • domain assumption Large-scale external-stimulus fMRI datasets can serve as effective training data for mental imagery decoding.
    Explicitly stated as a conclusion in the abstract and required for the claim that existing datasets are viable.

pith-pipeline@v0.9.0 · 5788 in / 1291 out tokens · 62897 ms · 2026-05-20T13:42:20.293337+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 3 internal anchors

  1. [1]

    Scotti, Mihir Tripathy, Cesare Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A

    Paul S. Scotti, Mihir Tripathy, Cesare Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A. Nor- man, and Tanishq Mathew Abraham. Mindeye2: shared-subject models enable fmri-to-image with 1 hour of data. InProceedings of the 41st International Conference on Machine Learning, 2024

  2. [2]

    Scotti, Ghislain St-Yves, Jesse Breedlove, Kendrick Kay, and Thomas Naselaris

    Reese Kneeland, Paul S. Scotti, Ghislain St-Yves, Jesse Breedlove, Kendrick Kay, and Thomas Naselaris. Nsd-imagery: A benchmark dataset for extending fmri vision decoding methods to mental imagery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 28852–28862, June 2025

  3. [4]

    Oxford University Press, 2006

    Stephen M Kosslyn, William L Thompson, and Giorgio Ganis.The case for mental imagery. Oxford University Press, 2006. 12

  4. [5]

    Top-Down Activation of Shape-Specific Population Codes in Visual Cortex during Mental Imagery.Journal of Neuroscience, 29(5):1565–1572, February 2009

    Mark Stokes, Russell Thompson, Rhodri Cusack, and John Duncan. Top-Down Activation of Shape-Specific Population Codes in Visual Cortex during Mental Imagery.Journal of Neuroscience, 29(5):1565–1572, February 2009. ISSN 0270-6474, 1529-2401. doi: 10.1523/ JNEUROSCI.4657-08.2009. URL https://www.jneurosci.org/content/29/5/1565. Publisher: Society for Neuros...

  5. [6]

    Reading Imagined Letter Shapes from the Mind’s Eye Using Real-time 7 Tesla fMRI

    Rainer Goebel, Rick van Hoof, Salil Bhat, Michael Lührs, and Mario Senden. Reading Imagined Letter Shapes from the Mind’s Eye Using Real-time 7 Tesla fMRI. In2022 10th International Winter Conference on Brain-Computer Interface (BCI), pages 1–3, February 2022. doi: 10.1109/BCI53720.2022.9735031. ISSN: 2572-7672

  6. [7]

    Olman, Dustin E

    Thomas Naselaris, Cheryl A. Olman, Dustin E. Stansbury, Kamil Ugurbil, and Jack L. Gallant. A voxel-wise encoding model for early visual areas decodes mental images of remembered scenes.NeuroImage, 105:215–228, January 2015. ISSN 1053-8119. doi: 10.1016/j.neuroimage.2014.10.018. URL https://www.sciencedirect.com/science/ article/pii/S1053811914008428

  7. [8]

    Reading the mind’s eye: Decoding category information during mental imagery.NeuroImage, 50(2):818–825, April 2010

    Leila Reddy, Naotsugu Tsuchiya, and Thomas Serre. Reading the mind’s eye: Decoding category information during mental imagery.NeuroImage, 50(2):818–825, April 2010. ISSN 1053-8119. doi: 10.1016/j.neuroimage.2009.11.084. URL https://www.sciencedirect. com/science/article/pii/S1053811909012701

  8. [9]

    Disentangling visual imagery and perception of real-world objects.Neuroimage, 59(4):4064–4073, 2012

    Sue-Hyun Lee, Dwight J Kravitz, and Chris I Baker. Disentangling visual imagery and perception of real-world objects.Neuroimage, 59(4):4064–4073, 2012

  9. [10]

    The human imagination: the cognitive neuroscience of visual mental imagery

    Joel Pearson. The human imagination: the cognitive neuroscience of visual mental imagery. Nature reviews neuroscience, 20(10):624–634, 2019

  10. [11]

    Comparison of signal to noise in vision and imagery for qualitatively different kinds of stimuli

    Tiasha Saha Roy, Jesse Breedlove, Ghislain St-Yves, Kendrick Kay, and Thomas Naselaris. Comparison of signal to noise in vision and imagery for qualitatively different kinds of stimuli. Journal of Vision, 23(9):5961, 2023. ISSN 1534-7362. doi: 10.1167/jov.23.9.5961. URL https://doi.org/10.1167/jov.23.9.5961

  11. [12]

    Mental imagery: Weak vision or compressed vision? InConference on Cognitive Computational Neuroscience, 2023

    Tiasha Saha Roy, Jesse Breedlove, Ghislain St-Yves, Kendrick Kay, and Thomas Naselaris. Mental imagery: Weak vision or compressed vision? InConference on Cognitive Computational Neuroscience, 2023. doi: 10.32470/CCN.2023.1693-0. URL https://2023.ccneuro.org/ view_paper4eea.html?PaperNum=1693

  12. [13]

    Breedlove, Ghislain St-Yves, Cheryl A

    Jesse L. Breedlove, Ghislain St-Yves, Cheryl A. Olman, and Thomas Naselaris. Generative feedback explains distinct brain activity codes for seen and mental images.Current Biology, 30 (12):2211–2224.e6, 2020. ISSN 0960-9822. doi: https://doi.org/10.1016/j.cub.2020.04.014. URLhttps://www.sciencedirect.com/science/article/pii/S0960982220304942

  13. [14]

    Spatial perception and memory have distinct activation profiles in human visual cortex.BioRxiv, page 811331, 2019

    Serra E Favila, Brice A Kuhl, and Jonathan Winawer. Spatial perception and memory have distinct activation profiles in human visual cortex.BioRxiv, page 811331, 2019

  14. [15]

    Imagery and perception share cortical representations of content and location.Cerebral cortex, 22(2):372–380, 2012

    Radoslaw M Cichy, Jakob Heinzle, and John-Dylan Haynes. Imagery and perception share cortical representations of content and location.Cerebral cortex, 22(2):372–380, 2012

  15. [16]

    Shared representations for working memory and mental imagery in early visual cortex.Current Biology, 23(15):1427–1431, 2013

    Anke Marit Albers, Peter Kok, Ivan Toni, H Chris Dijkerman, and Floris P De Lange. Shared representations for working memory and mental imagery in early visual cortex.Current Biology, 23(15):1427–1431, 2013

  16. [17]

    Do better models of fmri visual response better predict mental imagery responses? InConference on Cognitive Computational Neuroscience, 2023

    Ghislain St-Yves, Jesse Breedlove, Kendrick Kay, and Thomas Naselaris. Do better models of fmri visual response better predict mental imagery responses? InConference on Cognitive Computational Neuroscience, 2023. doi: 10.32470/CCN.2023.1644-0. URL https://2023. ccneuro.org/view_paper37c6.html?PaperNum=1644

  17. [18]

    Inverse retinotopy: Inferring the visual content of images from brain activation patterns.NeuroImage, 33(4):1104–1116, December 2006

    Bertrand Thirion, Edouard Duchesnay, Edward Hubbard, Jessica Dubois, Jean-Baptiste Poline, Denis Lebihan, and Stanislas Dehaene. Inverse retinotopy: Inferring the visual content of images from brain activation patterns.NeuroImage, 33(4):1104–1116, December 2006. ISSN 10538119. doi: 10.1016/j.neuroimage.2006.06.062. URL https://linkinghub.elsevier. com/ret...

  18. [19]

    Emmerling, Rick van Hoof, Martin A

    Mario Senden, Thomas C. Emmerling, Rick van Hoof, Martin A. Frost, and Rainer Goebel. Reconstructing imagined letters from early visual cortex reveals tight topographic correspon- dence between visual mental imagery and perception.Brain Structure and Function, 224(3): 1167–1183, Jan 2019. doi: 10.1007/s00429-019-01828-6

  19. [20]

    Hongmi Lee and Brice A. Kuhl. Reconstructing perceived and retrieved faces from activity patterns in lateral parietal cortex.Journal of Neuroscience, 36(22):6069–6082, 2016. Publisher: Soc Neuroscience

  20. [21]

    Deep image recon- struction from human brain activity.PLOS Computational Biology, 15(1):e1006633, January

    Guohua Shen, Tomoyasu Horikawa, Kei Majima, and Yukiyasu Kamitani. Deep image recon- struction from human brain activity.PLOS Computational Biology, 15(1):e1006633, January

  21. [22]

    doi: 10.1371/journal.pcbi.1006633

    ISSN 1553-7358. doi: 10.1371/journal.pcbi.1006633. URL https://dx.plos.org/ 10.1371/journal.pcbi.1006633

  22. [23]

    Mental image reconstruction from human brain activity: Neural decoding of mental imagery via deep neural network-based bayesian estimation.Neural Networks, 170:349–363, 2024

    Naoko Koide-Majima, Shinji Nishimoto, and Kei Majima. Mental image reconstruction from human brain activity: Neural decoding of mental imagery via deep neural network-based bayesian estimation.Neural Networks, 170:349–363, 2024. ISSN 0893-6080. doi: https:// doi.org/10.1016/j.neunet.2023.11.024. URL https://www.sciencedirect.com/science/ article/pii/S0893...

  23. [24]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Confer- ence on Machin...

  24. [25]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models.CoRR, abs/2112.10752, 2021. URL https://arxiv.org/abs/2112.10752

  25. [26]

    Allen, Ghislain St-Yves, Yihan Wu, Jesse L

    Emily J. Allen, Ghislain St-Yves, Yihan Wu, Jesse L. Breedlove, Jacob S. Prince, Logan T. Dowdle, Matthias Nau, Brad Caron, Franco Pestilli, Ian Charest, J. Benjamin Hutchin- son, Thomas Naselaris, and Kendrick Kay. A massive 7T fMRI dataset to bridge cog- nitive neuroscience and artificial intelligence.Nature Neuroscience, 25(1):116–126, Jan- uary 2022. ...

  26. [27]

    High-resolution image reconstruction with latent diffusion models from human brain activity

    Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023

  27. [28]

    Improving visual image reconstruction from human brain activity using latent diffusion models via multiple decoded inputs, 2023

    Yu Takagi and Shinji Nishimoto. Improving visual image reconstruction from human brain activity using latent diffusion models via multiple decoded inputs, 2023

  28. [29]

    Natural scene reconstruction from fmri signals using gen- erative latent diffusion.Scientific Reports, 13, 2023

    Furkan Ozcelik and Rufin VanRullen. Natural scene reconstruction from fmri signals using gen- erative latent diffusion.Scientific Reports, 13, 2023. URL https://api.semanticscholar. org/CorpusID:260439960

  29. [30]

    Reconstructing the mind’s eye: fMRI-to-image with contrastive learning and diffusion priors

    Paul Steven Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen, Cohen Ethan, Aidan James Dempster, Nathalie Verlinde, Elad Yundler, David Weisberg, Kenneth Norman, and Tanishq Mathew Abraham. Reconstructing the mind’s eye: fMRI-to-image with contrastive learning and diffusion priors. InThirty-seventh Conference on Neural Information Pro...

  30. [31]

    Reconstructing seen images from human brain activity via guided stochastic search

    Reese Kneeland, Jordyn Ojeda, Ghislain St-Yves, and Thomas Naselaris. Reconstructing seen images from human brain activity via guided stochastic search. InConference on Cognitive Computational Neuroscience, 2023. doi: 10.32470/CCN.2023.1672-0. URL https://2023. ccneuro.org/view_paper1337.html?PaperNum=1672

  31. [32]

    Second Sight: Using brain-optimized encoding models to align image distributions with human brain activity, June

    Reese Kneeland, Jordyn Ojeda, Ghislain St-Yves, and Thomas Naselaris. Second Sight: Using brain-optimized encoding models to align image distributions with human brain activity, June

  32. [33]

    arXiv:2306.00927 [cs, q-bio]

    URLhttp://arxiv.org/abs/2306.00927. arXiv:2306.00927 [cs, q-bio]. 14

  33. [34]

    Brain-optimized inference improves reconstructions of fMRI brain activity, December 2023

    Reese Kneeland, Jordyn Ojeda, Ghislain St-Yves, and Thomas Naselaris. Brain-optimized inference improves reconstructions of fMRI brain activity, December 2023. URL http: //arxiv.org/abs/2312.07705. arXiv:2312.07705 [cs, q-bio]

  34. [35]

    Through their eyes: multi-subject brain decoding with simple alignment techniques.Imaging Neuroscience, 2, 04 2024

    Matteo Ferrante, Tommaso Boccato, Furkan Ozcelik, Rufin VanRullen, and Nicola Toschi. Through their eyes: multi-subject brain decoding with simple alignment techniques.Imaging Neuroscience, 2, 04 2024. doi: 10.1162/imag_a_00170

  35. [36]

    Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding

    Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, and Juan Helen Zhou. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22710–22720, 2022. URLhttps://api.semanticscholar.org/CorpusID:253510456

  36. [37]

    Contrast, Attend and Diffuse to Decode High-Resolution Images from Brain Activities, December 2023

    Jingyuan Sun, Mingxiao Li, Zijiao Chen, Yunhao Zhang, Shaonan Wang, and Marie-Francine Moens. Contrast, Attend and Diffuse to Decode High-Resolution Images from Brain Activities, December 2023. URLhttp://arxiv.org/abs/2305.17214. arXiv:2305.17214 [cs]

  37. [38]

    UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Human Brain Activity, August 2023

    Weijian Mai and Zhijun Zhang. UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Human Brain Activity, August 2023. URL http://arxiv.org/ abs/2308.07428. arXiv:2308.07428 [cs]

  38. [39]

    Very deep {vae}s generalize autoregressive models and can outperform them on images

    Rewon Child. Very deep {vae}s generalize autoregressive models and can outperform them on images. InInternational Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=RLRXCV6DbEJ

  39. [40]

    SDXL: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=di52zR8xgf

  40. [41]

    Versatile diffusion: Text, images and variations all in one diffusion model.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7720–7731, 2022

    Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. Versatile diffusion: Text, images and variations all in one diffusion model.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7720–7731, 2022. URLhttps://api.semanticscholar. org/CorpusID:253523371

  41. [42]

    Wang, Kendrick Kay, Thomas Naselaris, Michael J

    Aria Y . Wang, Kendrick Kay, Thomas Naselaris, Michael J. Tarr, and Leila Wehbe. Incorporating natural language into vision models improves prediction and understanding of higher visual cortex, September 2022. URL https://www.biorxiv.org/content/10.1101/2022.09. 27.508760v1. Pages: 2022.09.27.508760 Section: New Results

  42. [43]

    Mindbridge: A cross-subject brain decoding framework

    Shizun Wang, Songhua Liu, Zhenxiong Tan, and Xinchao Wang. Mindbridge: A cross-subject brain decoding framework. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11333–11342, 2024

  43. [44]

    Neuropictor: Refining fmri-to-image reconstruction via multi-individual pretraining and multi-level modulation, 2024

    Jingyang Huo, Yikai Wang, Xuelin Qian, Yun Wang, Chong Li, Jianfeng Feng, and Yanwei Fu. Neuropictor: Refining fmri-to-image reconstruction via multi-individual pretraining and multi-level modulation, 2024

  44. [45]

    Brainram: Cross- modality retrieval-augmented image reconstruction from human brain activity

    Dian Xie, Peiang Zhao, Jiarui Zhang, Kangqi Wei, Xiaobao Ni, and Jiong Xia. Brainram: Cross- modality retrieval-augmented image reconstruction from human brain activity. InProceedings of the 32nd ACM International Conference on Multimedia, MM ’24, page 3994–4003, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 9798400706868. doi: 10.11...

  45. [46]

    Wang, A.C

    Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, April 2004. ISSN 1941-0042. doi: 10.1109/TIP.2003.819861. Conference Name: IEEE Transactions on Image Processing

  46. [47]

    Imagenet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/ c399862d3b9...

  47. [48]

    Rethinking the Inception Architecture for Computer Vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision.CoRR, abs/1512.00567, 2015. URL http://arxiv.org/abs/1512.00567

  48. [49]

    Mingxing Tan and Quoc V . Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 ofProceedings of Machine Learning Research, pages 6105–6114. ...

  49. [50]

    Unsupervised learning of visual features by contrasting cluster assignments.CoRR, abs/2006.09882, 2020

    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments.CoRR, abs/2006.09882, 2020. URLhttps://arxiv.org/abs/2006.09882

  50. [51]

    A perceptually based comparison of image similarity metrics

    Pawan Sinha and Richard Russell. A perceptually based comparison of image similarity metrics. Perception, 40(11):1269–1281, 2011. doi: 10.1068/p7063. URL https://doi.org/10. 1068/p7063. PMID: 22416586

  51. [52]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview. net/forum?id=G5RwHpBUv0

  52. [53]

    Aoki, Kei Majima, Yusuke Muraki, and Yukiyasu Kamitani

    Ken Shirakawa, Yoshihiro Nagano, Misato Tanaka, Shuntaro C. Aoki, Kei Majima, Yusuke Muraki, and Yukiyasu Kamitani. Spurious reconstruction from brain activity: The thin line between reconstruction, classification, and hallucination.Journal of Vision, 2024. URL https://api.semanticscholar.org/CorpusID:269791182

  53. [54]

    Brainbits: How much of the brain are generative reconstruction methods using? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

    David Mayo, Christopher Wang, Asa Harbin, Abdulrahman Alabdulkareem, Albert Eaton Shaw, Boris Katz, and Andrei Barbu. Brainbits: How much of the brain are generative reconstruction methods using? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=KAAUvi4kpb

  54. [55]

    Mental imagery in emotion and emotional disorders

    Emily A Holmes and Andrew Mathews. Mental imagery in emotion and emotional disorders. Clinical psychology review, 30(3):349–362, 2010

  55. [56]

    Giacino and Kathleen Kalmar

    Joseph T. Giacino and Kathleen Kalmar. The vegetative and minimally conscious states: A comparison of clinical features and functional outcome.Journal of Head Trauma Rehabilitation, 12(4):36–51, 1997. doi: 10.1097/00001199-199708000-00005

  56. [57]

    Spencer, Catherine J

    Brian L Edlow, Camille Chatelle, Camille A. Spencer, Catherine J. Chu, Yelena G. Bodien, Kathryn L. O’Connor, Ronald E. Hirschberg, Leigh R. Hochberg, Joseph T. Giacino, Eric S. Rosenthal, and et al. Early detection of consciousness in patients with acute severe traumatic brain injury.Brain, 140(9):2399–2414, 2017. doi: 10.1093/brain/awx176

  57. [58]

    Turgeon, François Lauzier, Jean-François Simard, Damon C

    Alexis F. Turgeon, François Lauzier, Jean-François Simard, Damon C. Scales, Karen E.A. Burns, Lynne Moore, David A. Zygun, Francis Bernard, Maureen O. Meade, Tran Cong Dung, and et al. Mortality associated with withdrawal of life-sustaining therapy for patients with severe traumatic brain injury: A canadian multicentre cohort study.Canadian Medical Associ...

  58. [59]

    Mures, anu

    Livia Livint, Popa, Diana Chira, S, tefan Strilciuc, and Dafin F. Mures, anu. Non-invasive systems application in traumatic brain injury rehabilitation.Brain Sciences, 13(11), 2023. ISSN 2076-

  59. [60]

    URL https://www.mdpi.com/2076-3425/13/11/ 1594

    doi: 10.3390/brainsci13111594. URL https://www.mdpi.com/2076-3425/13/11/ 1594

  60. [61]

    Shiyu Luo, Qinwan Rabbani, and Nathan E. Crone. Brain-computer interface: Applications to speech decoding and synthesis to augment communication.Neurotherapeutics, 19(1):263–273, Jan 2022. doi: 10.1007/s13311-022-01190-2

  61. [62]

    Vansteensel, Sandra M

    Evan Canny, Mariska J. Vansteensel, Sandra M. van der Salm, Gernot R. Müller-Putz, and Julia Berezutskaya. Boosting brain–computer interfaces with functional electrical stimulation: Potential applications in people with locked-in syndrome.Journal of NeuroEngineering and Rehabilitation, 20(1), Nov 2023. doi: 10.1186/s12984-023-01272-y. 16

  62. [63]

    Gordon and Anil K

    Emma C. Gordon and Anil K. Seth. Ethical considerations for the use of brain–computer interfaces for cognitive enhancement.PLOS Biology, 22(10):1–15, 10 2024. doi: 10.1371/ journal.pbio.3002899. URLhttps://doi.org/10.1371/journal.pbio.3002899

  63. [64]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10602-1

  64. [65]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

  65. [66]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, June 2024

  66. [67]

    Reproducible scaling laws for contrastive language-image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023

  67. [68]

    Neural networks and the bias/variance dilemma.Neural Computation, 4(1):1–58, Jan 1992

    Stuart Geman, Elie Bienenstock, and René Doursat. Neural networks and the bias/variance dilemma.Neural Computation, 4(1):1–58, Jan 1992. doi: 10.1162/neco.1992.4.1.1

  68. [69]

    Hoerl and Robert W

    Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970. ISSN 00401706. URL http://www.jstor. org/stable/1267351

  69. [70]

    doi: 10.1038/s42256-023-00753-y

    Aria Y . Wang, Kendrick Kay, Thomas Naselaris, Michael J. Tarr, and Leila Wehbe. Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset.Nature Machine Intelligence, 5(12):1415–1426, December 2023. ISSN 2522-5839. doi: 10.1038/s42256-023-00753-y. Publisher Copyright: 2023, The Author(s), un...

  70. [71]

    Würstchen: An efficient architecture for large-scale text-to-image diffusion mod- els

    Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubre- ville. Würstchen: An efficient architecture for large-scale text-to-image diffusion mod- els. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gU58d5QeGv

  71. [72]

    GIT: A generative image-to-text transformer for vision and language.Transactions on Machine Learning Research, 2022

    Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. GIT: A generative image-to-text transformer for vision and language.Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=b4tMhpN0JC

  72. [73]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Image synthesis and editing with stochastic differential equations.CoRR, abs/2108.01073, 2021. URLhttps://arxiv.org/abs/2108.01073

  73. [74]

    Brain Captioning: Decoding human brain activity into images and text, May 2023

    Matteo Ferrante, Furkan Ozcelik, Tommaso Boccato, Rufin VanRullen, and Nicola Toschi. Brain Captioning: Decoding human brain activity into images and text, May 2023. URL http://arxiv.org/abs/2305.11560. arXiv:2305.11560 [cs]

  74. [75]

    distractor

    Ghislain St-Yves, Emily J. Allen, Yihan Wu, Kendrick Kay, and Thomas Naselaris. Brain- optimized deep neural network models of human visual areas learn non-hierarchical repre- sentations.Nature Communications, 14(1):3329, 2023. ISSN 2041-1723. doi: 10.1038/ s41467-023-38674-4. URLhttps://doi.org/10.1038/s41467-023-38674-4. 17 Supporting information S1 Tex...

  75. [76]

    During our initial trials, normalization led to unexpected color distortions in the reconstructed images

    Normalization of images:We disabled normalization of images when computing VGG19 features. During our initial trials, normalization led to unexpected color distortions in the reconstructed images. Removing normalization allowed the reconstructions to maintain their original color integrity, which is particularly crucial for visual comparisons in tasks req...

  76. [77]

    For clarity, Derivative Works do not include the output of any Model

    Feature decoding with Ridge Regression:Instead of the fastl2lir library, we employed the Ridge Regression implementation from the sklearn library. This change enhanced compatibility with the rest of our workflow and provided better support for managing memory-intensive computations. For VGG19 layers with a large feature space, feature decoding was perform...