MIRAGE: Robust multi-modal architectures translate fMRI-to-image models from vision to mental imagery

Cesar Kadir Torrico Villanueva; Jonathan Xu; Jordyn Ojeda; Paul S. Scotti; Reese Kneeland; Shuhb Khanna; Thomas Naselaris

arxiv: 2605.17198 · v1 · pith:7YJVU6BGnew · submitted 2026-05-16 · 🧬 q-bio.NC · cs.CV

MIRAGE: Robust multi-modal architectures translate fMRI-to-image models from vision to mental imagery

Reese Kneeland , Cesar Kadir Torrico Villanueva , Jordyn Ojeda , Shuhb Khanna , Jonathan Xu , Paul S. Scotti , Thomas Naselaris This is my paper

Pith reviewed 2026-05-20 13:42 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.CV

keywords mental image reconstructionfMRI decodingmulti-modal featuresdiffusion modelNSD-Imageryvision-to-imagery transferlinear backbone

0 comments

The pith

A linear backbone with multi-modal text and image features lets a diffusion model reconstruct mental images from fMRI after training only on external visual stimuli.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that high performance on reconstructing seen images from brain activity does not automatically transfer to internally generated mental images, and that some existing decoders fail on the latter task. It introduces MIRAGE to close this gap by feeding a linear combination of text and image features into a diffusion model, then demonstrates superior results on the NSD-Imagery benchmark via both quantitative metrics and human ratings. The central finding is that the right architecture makes large external vision datasets usable for mental-image decoding, removing the need for direct mental-imagery training data. This matters because successful cross-decoding would let researchers study visual thought without requiring participants to view external stimuli during every scan.

Core claim

MIRAGE trains on large-scale datasets of external visual stimuli to decode mental images from brain activity. It uses a linear backbone that combines multi-modal text features with both high- and low-level image features as conditioning input to a diffusion model. On the NSD-Imagery benchmark this yields state-of-the-art reconstructions according to feature-space metrics and human raters. Ablation experiments indicate that performance peaks when image features are kept low-dimensional and when text guidance is included alongside both high- and low-level visual features.

What carries the argument

The MIRAGE linear backbone that fuses multi-modal text and image features to condition a diffusion model for fMRI-to-image translation.

If this is right

Mental-image reconstruction reaches state-of-the-art levels without any direct training on internally generated imagery data.
Low-dimensional image features plus text and both high- and low-level visual cues produce the most accurate mental-image outputs.
Existing large-scale vision datasets become viable training resources for mental-image decoders once the architecture is chosen appropriately.
The gap between seen-image and mental-image decoding performance can be closed by explicit multi-modal conditioning rather than by scaling model size alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the transfer works, brain-computer interfaces could visualize a person's current visual thought without requiring them to look at matching external pictures.
The same multi-modal conditioning strategy might extend to decoding other internal states such as auditory imagery or spatial navigation.
Future tests could check whether the low-dimensional feature preference holds when the diffusion model is replaced by a different generative backbone.
The result implies that mental imagery and external vision share a common low-dimensional representational subspace that can be read out with modest additional guidance.

Load-bearing premise

Brain activity patterns evoked by external visual stimuli are similar enough to those generated during mental imagery that a decoder trained on the former can be applied to the latter.

What would settle it

A controlled experiment in which participants generate mental images while scanned, a model is trained directly on those mental-image fMRI pairs, and that model produces reconstructions rated higher by humans or closer in feature space than MIRAGE outputs on the same test set.

Figures

Figures reproduced from arXiv: 2605.17198 by Cesar Kadir Torrico Villanueva, Jonathan Xu, Jordyn Ojeda, Paul S. Scotti, Reese Kneeland, Shuhb Khanna, Thomas Naselaris.

**Figure 1.** Figure 1: MIRAGE (ours) vs MindEye2 [1] reconstructions of an imagined image from fMRI brain activity. 1 Introduction The ability to decode and reconstruct mental images—internally generated visual representations not driven by sensory input—from brain activity has tremendous potential for downstream applications such as brain-computer interfaces and medical diagnostics for patients with disorders of communication … view at source ↗

**Figure 2.** Figure 2: Qualitative comparison of reconstruction [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: (B) Human similarity scores for simple and complex stimuli: X-axis = vision, Y-axis = imagery; each point is the mean over 12 samples (larger bold points are the overall means), colored/shaped by method. PCA-fit slopes closer to unity indicate tighter imagery–vision correspondence; dashed unity line shown. 2.3 Ablation Study We systematically ablated model components to identify which were most important f… view at source ↗

**Figure 4.** Figure 4: (A) Head-to-head human similarity score results for the conceptual stimuli. The Y-axis represents the similarity score advantage (difference between target method’s score and the alternative, on the radial X-axis); a larger colored polygon area indicates a stronger advantage, and the dashed circle at unity denotes equal performance. MIRAGE outperforms all other methods (p < 0.001). (B) Ablation analyses: m… view at source ↗

**Figure 5.** Figure 5: Overview of the tasks utilized for the NSD-Imagery benchmark. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: MIRAGE training pipeline. (1) Brain activity (7T fMRI) acquired as NSD subjects view > 10K stimuli. (2) Stimuli are passed to VDVAE encoder [37] yielding (1 × 91168) latents (3) LLaVA v1.5-13B [62, 63] generates synthetic captions. (4) Captions are encoded into CLIP ViTbigG/14 text embeddings (77 × 1280) [64]. (5) Stimuli are also passed through the CLIP ViT-L/14 image encoder [23] to generate both CLS to… view at source ↗

**Figure 7.** Figure 7: MIRAGE inference pipeline. (1) The NSD subjects imagine stimuli from letter cues under 7T fMRI. (2) A set of feature embeddings is predicted by passing the measured fMRI brain activity through our frozen ridge regression models. (3) The VDVAE [37] latents are reconstructed into a low-level image. (4) The image is filtered to boost its structure. (5) The filtered low-level image, decoded image embedding, an… view at source ↗

read the original abstract

To be useful for downstream applications, vision decoding models that are trained to reconstruct seen images from human brain activity must be able to generalize to internally generated visual representations, i.e., mental images. In an analysis of the recently released NSD-Imagery dataset, we demonstrated that while some modern vision decoders can perform quite well on mental image reconstruction, some fail, and that state-of-the-art (SOTA) performance on seen image reconstruction is no guarantee of SOTA performance on mental image reconstruction. Motivated by these findings, we developed MIRAGE, a method explicitly designed to train on vision datasets and cross-decode mental images from brain activity. MIRAGE employs a linear backbone and multi-modal text and image features as input to a diffusion model. Feature metrics and human raters establish MIRAGE as SOTA for mental image reconstruction on the NSD-Imagery benchmark. With ablation analysis we show that mental image reconstruction works best when decoders use image features with relatively few dimensions and include guidance from text-based and both high- and low-level image-based features. Our work indicates that--given the right architecture--existing large-scale datasets using external stimuli are viable training data for decoding mental images, and warrant optimism about the future success and utility of mental image reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MIRAGE shows a linear multi-modal decoder trained on vision data can hit SOTA mental imagery reconstruction on NSD-Imagery, but the cued-recall setup leaves open whether it truly generalizes to pure internal generation.

read the letter

The main takeaway is that MIRAGE uses a linear backbone fed with low-dimensional image features plus text and high/low-level guidance, then feeds that into a diffusion model, and this beats prior approaches on the NSD-Imagery mental reconstruction benchmark. The paper first shows that strong seen-image decoders do not automatically transfer to mental images, which is a useful observation, then builds a targeted fix with ablations that point to fewer feature dimensions and mixed text/image guidance as helpful choices. That combination and the explicit comparison across seen versus imagined conditions is the concrete new piece. The work is grounded in existing diffusion and multi-modal tools but applies them with a clear focus on the cross-domain gap, and the use of both feature metrics and human raters gives the results some external check. The soft spot is that the abstract and available details give no actual numbers, error bars, or statistical tests, so it is hard to judge the size of the improvement or how robust the SOTA claim is. More importantly, NSD-Imagery mental trials are cued by previously viewed scenes, so the brain activity may still carry recall signals that overlap with the external-stimuli training distribution rather than testing fully arbitrary internal imagery. That does not invalidate the results on this benchmark, but it does mean the claim that large vision datasets are viable for mental decoding rests on a narrower test than the abstract suggests. The paper is aimed at people working on fMRI decoding and brain-computer interfaces that need to handle internal states. Anyone already following NSD or mental imagery reconstruction will find the architecture and ablation choices worth looking at. It is coherent on its own terms and engages the literature directly, so it deserves a serious referee even if the quantitative gaps and the cued-recall issue need to be addressed in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces MIRAGE, a linear multi-modal architecture that trains on external vision datasets and uses text plus image features (high- and low-level) as input to a diffusion model for reconstructing mental images from fMRI. On the NSD-Imagery benchmark it reports state-of-the-art performance via feature metrics and human ratings, with ablations showing best results when image features are low-dimensional and guidance from text and both high- and low-level image features is included. The central conclusion is that large-scale seen-image datasets are viable training data for mental-image decoding.

Significance. If the reported generalization holds, the result would be significant for brain decoding: it supplies an explicit architecture and training recipe that bridges seen-image and mental-image regimes, supplies concrete ablation evidence on which feature types matter, and offers a relatively simple linear backbone that may aid interpretability. The work also supplies a falsifiable prediction that performance on cued mental imagery should transfer when the same linear mapping is applied to novel, non-cued internal content.

major comments (2)

[Abstract and §4] Abstract and §4 (Evaluation): the claim that 'existing large-scale datasets using external stimuli are viable training data for decoding mental images' is load-bearing for the paper's main contribution, yet the NSD-Imagery mental-imagery trials are cued by previously viewed natural scenes. This leaves open the possibility that brain activity contains recall components aligned with the training distribution rather than arbitrary internally generated content; the reported ablations do not isolate this factor.
[Results] Results section: the assertion of SOTA performance is supported only by the statement that 'feature metrics and human raters establish MIRAGE as SOTA'; no numerical values, baseline scores, error bars, or statistical tests appear in the abstract or summary text, preventing direct verification of the performance gap.

minor comments (2)

[Methods] Methods: the exact procedure for obtaining and reducing the dimensionality of the image features (one of the free parameters listed in the axiom ledger) should be stated explicitly, including the source embedding model and any learned projection.
[Figure 1 and §3] Figure captions and §3: the multi-modal input diagram should label each feature stream (text, high-level image, low-level image) and indicate whether the linear backbone is trained exclusively on vision data before zero-shot application to mental-imagery fMRI.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. These help us clarify the scope of our generalization claims and strengthen the presentation of quantitative results. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Evaluation): the claim that 'existing large-scale datasets using external stimuli are viable training data for decoding mental images' is load-bearing for the paper's main contribution, yet the NSD-Imagery mental-imagery trials are cued by previously viewed natural scenes. This leaves open the possibility that brain activity contains recall components aligned with the training distribution rather than arbitrary internally generated content; the reported ablations do not isolate this factor.

Authors: We agree that the NSD-Imagery mental-imagery trials are cued by previously viewed scenes and therefore may engage recall processes in addition to internally generated content. Our ablations examine feature-type contributions rather than isolating recall versus pure generation. Nevertheless, the central result remains that a linear multi-modal model trained exclusively on external vision data successfully decodes these mental images, supporting the viability of large-scale seen-image datasets for mental-image reconstruction on this benchmark. We will revise the abstract and add a paragraph in the Discussion to explicitly note the cued nature of the imagery, distinguish it from uncued internal content, and frame this as a limitation for future work. revision: partial
Referee: [Results] Results section: the assertion of SOTA performance is supported only by the statement that 'feature metrics and human raters establish MIRAGE as SOTA'; no numerical values, baseline scores, error bars, or statistical tests appear in the abstract or summary text, preventing direct verification of the performance gap.

Authors: We agree that the abstract and high-level summary would be strengthened by including concrete numerical comparisons. The full Results section already reports detailed feature-metric values, baseline scores, human ratings, and statistical comparisons in tables and figures. We will revise the abstract to include representative quantitative results (e.g., key metric improvements and human preference rates) with pointers to the supporting tables, enabling immediate verification of the SOTA claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is empirically grounded

full rationale

The paper trains MIRAGE (linear backbone + multi-modal features into diffusion model) on external-stimulus vision datasets and reports SOTA metrics on the held-out NSD-Imagery mental-imagery benchmark. The central claim—that such training data are viable for mental-image decoding—rests on cross-domain empirical performance rather than any self-definitional mapping, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations or sections in the provided text reduce the generalization result to a quantity defined by the model’s own fitted values. The architecture choices and ablation results (low-dimensional image features plus text/high-low guidance) are presented as independent design decisions whose success is measured externally. This is the most common honest non-finding for a methods paper whose test set is distinct from its training distribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that pre-trained vision and language models produce features that can be linearly mapped from fMRI to guide diffusion-based generation of mental images, plus the domain assumption that the NSD-Imagery dataset faithfully represents mental imagery.

free parameters (1)

dimensionality of image features
Ablation analysis identifies relatively few dimensions as optimal, implying this hyperparameter is selected to fit performance on the benchmark.

axioms (1)

domain assumption Large-scale external-stimulus fMRI datasets can serve as effective training data for mental imagery decoding.
Explicitly stated as a conclusion in the abstract and required for the claim that existing datasets are viable.

pith-pipeline@v0.9.0 · 5788 in / 1291 out tokens · 62897 ms · 2026-05-20T13:42:20.293337+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MIRAGE employs a linear backbone and multi-modal text and image features as input to a diffusion model... ablation analysis we show that mental image reconstruction works best when decoders use image features with relatively few dimensions and include guidance from text-based and both high- and low-level image-based features.
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce MIRAGE... linear decoding backbones with low-dimensional multi-modal feature spaces

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 3 internal anchors

[1]

Scotti, Mihir Tripathy, Cesare Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A

Paul S. Scotti, Mihir Tripathy, Cesare Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A. Nor- man, and Tanishq Mathew Abraham. Mindeye2: shared-subject models enable fmri-to-image with 1 hour of data. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024
[2]

Scotti, Ghislain St-Yves, Jesse Breedlove, Kendrick Kay, and Thomas Naselaris

Reese Kneeland, Paul S. Scotti, Ghislain St-Yves, Jesse Breedlove, Kendrick Kay, and Thomas Naselaris. Nsd-imagery: A benchmark dataset for extending fmri vision decoding methods to mental imagery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 28852–28862, June 2025

work page 2025
[4]

Oxford University Press, 2006

Stephen M Kosslyn, William L Thompson, and Giorgio Ganis.The case for mental imagery. Oxford University Press, 2006. 12

work page 2006
[5]

Top-Down Activation of Shape-Specific Population Codes in Visual Cortex during Mental Imagery.Journal of Neuroscience, 29(5):1565–1572, February 2009

Mark Stokes, Russell Thompson, Rhodri Cusack, and John Duncan. Top-Down Activation of Shape-Specific Population Codes in Visual Cortex during Mental Imagery.Journal of Neuroscience, 29(5):1565–1572, February 2009. ISSN 0270-6474, 1529-2401. doi: 10.1523/ JNEUROSCI.4657-08.2009. URL https://www.jneurosci.org/content/29/5/1565. Publisher: Society for Neuros...

work page 2009
[6]

Reading Imagined Letter Shapes from the Mind’s Eye Using Real-time 7 Tesla fMRI

Rainer Goebel, Rick van Hoof, Salil Bhat, Michael Lührs, and Mario Senden. Reading Imagined Letter Shapes from the Mind’s Eye Using Real-time 7 Tesla fMRI. In2022 10th International Winter Conference on Brain-Computer Interface (BCI), pages 1–3, February 2022. doi: 10.1109/BCI53720.2022.9735031. ISSN: 2572-7672

work page doi:10.1109/bci53720.2022.9735031 2022
[7]

Olman, Dustin E

Thomas Naselaris, Cheryl A. Olman, Dustin E. Stansbury, Kamil Ugurbil, and Jack L. Gallant. A voxel-wise encoding model for early visual areas decodes mental images of remembered scenes.NeuroImage, 105:215–228, January 2015. ISSN 1053-8119. doi: 10.1016/j.neuroimage.2014.10.018. URL https://www.sciencedirect.com/science/ article/pii/S1053811914008428

work page doi:10.1016/j.neuroimage.2014.10.018 2015
[8]

Reading the mind’s eye: Decoding category information during mental imagery.NeuroImage, 50(2):818–825, April 2010

Leila Reddy, Naotsugu Tsuchiya, and Thomas Serre. Reading the mind’s eye: Decoding category information during mental imagery.NeuroImage, 50(2):818–825, April 2010. ISSN 1053-8119. doi: 10.1016/j.neuroimage.2009.11.084. URL https://www.sciencedirect. com/science/article/pii/S1053811909012701

work page doi:10.1016/j.neuroimage.2009.11.084 2010
[9]

Disentangling visual imagery and perception of real-world objects.Neuroimage, 59(4):4064–4073, 2012

Sue-Hyun Lee, Dwight J Kravitz, and Chris I Baker. Disentangling visual imagery and perception of real-world objects.Neuroimage, 59(4):4064–4073, 2012

work page 2012
[10]

The human imagination: the cognitive neuroscience of visual mental imagery

Joel Pearson. The human imagination: the cognitive neuroscience of visual mental imagery. Nature reviews neuroscience, 20(10):624–634, 2019

work page 2019
[11]

Comparison of signal to noise in vision and imagery for qualitatively different kinds of stimuli

Tiasha Saha Roy, Jesse Breedlove, Ghislain St-Yves, Kendrick Kay, and Thomas Naselaris. Comparison of signal to noise in vision and imagery for qualitatively different kinds of stimuli. Journal of Vision, 23(9):5961, 2023. ISSN 1534-7362. doi: 10.1167/jov.23.9.5961. URL https://doi.org/10.1167/jov.23.9.5961

work page doi:10.1167/jov.23.9.5961 2023
[12]

Mental imagery: Weak vision or compressed vision? InConference on Cognitive Computational Neuroscience, 2023

Tiasha Saha Roy, Jesse Breedlove, Ghislain St-Yves, Kendrick Kay, and Thomas Naselaris. Mental imagery: Weak vision or compressed vision? InConference on Cognitive Computational Neuroscience, 2023. doi: 10.32470/CCN.2023.1693-0. URL https://2023.ccneuro.org/ view_paper4eea.html?PaperNum=1693

work page doi:10.32470/ccn.2023.1693-0 2023
[13]

Breedlove, Ghislain St-Yves, Cheryl A

Jesse L. Breedlove, Ghislain St-Yves, Cheryl A. Olman, and Thomas Naselaris. Generative feedback explains distinct brain activity codes for seen and mental images.Current Biology, 30 (12):2211–2224.e6, 2020. ISSN 0960-9822. doi: https://doi.org/10.1016/j.cub.2020.04.014. URLhttps://www.sciencedirect.com/science/article/pii/S0960982220304942

work page doi:10.1016/j.cub.2020.04.014 2020
[14]

Spatial perception and memory have distinct activation profiles in human visual cortex.BioRxiv, page 811331, 2019

Serra E Favila, Brice A Kuhl, and Jonathan Winawer. Spatial perception and memory have distinct activation profiles in human visual cortex.BioRxiv, page 811331, 2019

work page 2019
[15]

Imagery and perception share cortical representations of content and location.Cerebral cortex, 22(2):372–380, 2012

Radoslaw M Cichy, Jakob Heinzle, and John-Dylan Haynes. Imagery and perception share cortical representations of content and location.Cerebral cortex, 22(2):372–380, 2012

work page 2012
[16]

Shared representations for working memory and mental imagery in early visual cortex.Current Biology, 23(15):1427–1431, 2013

Anke Marit Albers, Peter Kok, Ivan Toni, H Chris Dijkerman, and Floris P De Lange. Shared representations for working memory and mental imagery in early visual cortex.Current Biology, 23(15):1427–1431, 2013

work page 2013
[17]

Do better models of fmri visual response better predict mental imagery responses? InConference on Cognitive Computational Neuroscience, 2023

Ghislain St-Yves, Jesse Breedlove, Kendrick Kay, and Thomas Naselaris. Do better models of fmri visual response better predict mental imagery responses? InConference on Cognitive Computational Neuroscience, 2023. doi: 10.32470/CCN.2023.1644-0. URL https://2023. ccneuro.org/view_paper37c6.html?PaperNum=1644

work page doi:10.32470/ccn.2023.1644-0 2023
[18]

Inverse retinotopy: Inferring the visual content of images from brain activation patterns.NeuroImage, 33(4):1104–1116, December 2006

Bertrand Thirion, Edouard Duchesnay, Edward Hubbard, Jessica Dubois, Jean-Baptiste Poline, Denis Lebihan, and Stanislas Dehaene. Inverse retinotopy: Inferring the visual content of images from brain activation patterns.NeuroImage, 33(4):1104–1116, December 2006. ISSN 10538119. doi: 10.1016/j.neuroimage.2006.06.062. URL https://linkinghub.elsevier. com/ret...

work page doi:10.1016/j.neuroimage.2006.06.062 2006
[19]

Emmerling, Rick van Hoof, Martin A

Mario Senden, Thomas C. Emmerling, Rick van Hoof, Martin A. Frost, and Rainer Goebel. Reconstructing imagined letters from early visual cortex reveals tight topographic correspon- dence between visual mental imagery and perception.Brain Structure and Function, 224(3): 1167–1183, Jan 2019. doi: 10.1007/s00429-019-01828-6

work page doi:10.1007/s00429-019-01828-6 2019
[20]

Hongmi Lee and Brice A. Kuhl. Reconstructing perceived and retrieved faces from activity patterns in lateral parietal cortex.Journal of Neuroscience, 36(22):6069–6082, 2016. Publisher: Soc Neuroscience

work page 2016
[21]

Deep image recon- struction from human brain activity.PLOS Computational Biology, 15(1):e1006633, January

Guohua Shen, Tomoyasu Horikawa, Kei Majima, and Yukiyasu Kamitani. Deep image recon- struction from human brain activity.PLOS Computational Biology, 15(1):e1006633, January

work page
[22]

doi: 10.1371/journal.pcbi.1006633

ISSN 1553-7358. doi: 10.1371/journal.pcbi.1006633. URL https://dx.plos.org/ 10.1371/journal.pcbi.1006633

work page doi:10.1371/journal.pcbi.1006633
[23]

Mental image reconstruction from human brain activity: Neural decoding of mental imagery via deep neural network-based bayesian estimation.Neural Networks, 170:349–363, 2024

Naoko Koide-Majima, Shinji Nishimoto, and Kei Majima. Mental image reconstruction from human brain activity: Neural decoding of mental imagery via deep neural network-based bayesian estimation.Neural Networks, 170:349–363, 2024. ISSN 0893-6080. doi: https:// doi.org/10.1016/j.neunet.2023.11.024. URL https://www.sciencedirect.com/science/ article/pii/S0893...

work page doi:10.1016/j.neunet.2023.11.024 2024
[24]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Confer- ence on Machin...

work page 2021
[25]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models.CoRR, abs/2112.10752, 2021. URL https://arxiv.org/abs/2112.10752

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

Allen, Ghislain St-Yves, Yihan Wu, Jesse L

Emily J. Allen, Ghislain St-Yves, Yihan Wu, Jesse L. Breedlove, Jacob S. Prince, Logan T. Dowdle, Matthias Nau, Brad Caron, Franco Pestilli, Ian Charest, J. Benjamin Hutchin- son, Thomas Naselaris, and Kendrick Kay. A massive 7T fMRI dataset to bridge cog- nitive neuroscience and artificial intelligence.Nature Neuroscience, 25(1):116–126, Jan- uary 2022. ...

work page doi:10.1038/s41593-021-00962-x 2022
[27]

High-resolution image reconstruction with latent diffusion models from human brain activity

Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023

work page 2023
[28]

Improving visual image reconstruction from human brain activity using latent diffusion models via multiple decoded inputs, 2023

Yu Takagi and Shinji Nishimoto. Improving visual image reconstruction from human brain activity using latent diffusion models via multiple decoded inputs, 2023

work page 2023
[29]

Natural scene reconstruction from fmri signals using gen- erative latent diffusion.Scientific Reports, 13, 2023

Furkan Ozcelik and Rufin VanRullen. Natural scene reconstruction from fmri signals using gen- erative latent diffusion.Scientific Reports, 13, 2023. URL https://api.semanticscholar. org/CorpusID:260439960

work page 2023
[30]

Reconstructing the mind’s eye: fMRI-to-image with contrastive learning and diffusion priors

Paul Steven Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen, Cohen Ethan, Aidan James Dempster, Nathalie Verlinde, Elad Yundler, David Weisberg, Kenneth Norman, and Tanishq Mathew Abraham. Reconstructing the mind’s eye: fMRI-to-image with contrastive learning and diffusion priors. InThirty-seventh Conference on Neural Information Pro...

work page 2023
[31]

Reconstructing seen images from human brain activity via guided stochastic search

Reese Kneeland, Jordyn Ojeda, Ghislain St-Yves, and Thomas Naselaris. Reconstructing seen images from human brain activity via guided stochastic search. InConference on Cognitive Computational Neuroscience, 2023. doi: 10.32470/CCN.2023.1672-0. URL https://2023. ccneuro.org/view_paper1337.html?PaperNum=1672

work page doi:10.32470/ccn.2023.1672-0 2023
[32]

Second Sight: Using brain-optimized encoding models to align image distributions with human brain activity, June

Reese Kneeland, Jordyn Ojeda, Ghislain St-Yves, and Thomas Naselaris. Second Sight: Using brain-optimized encoding models to align image distributions with human brain activity, June

work page
[33]

arXiv:2306.00927 [cs, q-bio]

URLhttp://arxiv.org/abs/2306.00927. arXiv:2306.00927 [cs, q-bio]. 14

work page arXiv
[34]

Brain-optimized inference improves reconstructions of fMRI brain activity, December 2023

Reese Kneeland, Jordyn Ojeda, Ghislain St-Yves, and Thomas Naselaris. Brain-optimized inference improves reconstructions of fMRI brain activity, December 2023. URL http: //arxiv.org/abs/2312.07705. arXiv:2312.07705 [cs, q-bio]

work page arXiv 2023
[35]

Through their eyes: multi-subject brain decoding with simple alignment techniques.Imaging Neuroscience, 2, 04 2024

Matteo Ferrante, Tommaso Boccato, Furkan Ozcelik, Rufin VanRullen, and Nicola Toschi. Through their eyes: multi-subject brain decoding with simple alignment techniques.Imaging Neuroscience, 2, 04 2024. doi: 10.1162/imag_a_00170

work page doi:10.1162/imag_a_00170 2024
[36]

Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding

Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, and Juan Helen Zhou. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22710–22720, 2022. URLhttps://api.semanticscholar.org/CorpusID:253510456

work page 2023
[37]

Contrast, Attend and Diffuse to Decode High-Resolution Images from Brain Activities, December 2023

Jingyuan Sun, Mingxiao Li, Zijiao Chen, Yunhao Zhang, Shaonan Wang, and Marie-Francine Moens. Contrast, Attend and Diffuse to Decode High-Resolution Images from Brain Activities, December 2023. URLhttp://arxiv.org/abs/2305.17214. arXiv:2305.17214 [cs]

work page arXiv 2023
[38]

UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Human Brain Activity, August 2023

Weijian Mai and Zhijun Zhang. UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Human Brain Activity, August 2023. URL http://arxiv.org/ abs/2308.07428. arXiv:2308.07428 [cs]

work page arXiv 2023
[39]

Very deep {vae}s generalize autoregressive models and can outperform them on images

Rewon Child. Very deep {vae}s generalize autoregressive models and can outperform them on images. InInternational Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=RLRXCV6DbEJ

work page 2021
[40]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=di52zR8xgf

work page 2024
[41]

Versatile diffusion: Text, images and variations all in one diffusion model.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7720–7731, 2022

Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. Versatile diffusion: Text, images and variations all in one diffusion model.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7720–7731, 2022. URLhttps://api.semanticscholar. org/CorpusID:253523371

work page 2023
[42]

Wang, Kendrick Kay, Thomas Naselaris, Michael J

Aria Y . Wang, Kendrick Kay, Thomas Naselaris, Michael J. Tarr, and Leila Wehbe. Incorporating natural language into vision models improves prediction and understanding of higher visual cortex, September 2022. URL https://www.biorxiv.org/content/10.1101/2022.09. 27.508760v1. Pages: 2022.09.27.508760 Section: New Results

work page doi:10.1101/2022.09 2022
[43]

Mindbridge: A cross-subject brain decoding framework

Shizun Wang, Songhua Liu, Zhenxiong Tan, and Xinchao Wang. Mindbridge: A cross-subject brain decoding framework. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11333–11342, 2024

work page 2024
[44]

Neuropictor: Refining fmri-to-image reconstruction via multi-individual pretraining and multi-level modulation, 2024

Jingyang Huo, Yikai Wang, Xuelin Qian, Yun Wang, Chong Li, Jianfeng Feng, and Yanwei Fu. Neuropictor: Refining fmri-to-image reconstruction via multi-individual pretraining and multi-level modulation, 2024

work page 2024
[45]

Brainram: Cross- modality retrieval-augmented image reconstruction from human brain activity

Dian Xie, Peiang Zhao, Jiarui Zhang, Kangqi Wei, Xiaobao Ni, and Jiong Xia. Brainram: Cross- modality retrieval-augmented image reconstruction from human brain activity. InProceedings of the 32nd ACM International Conference on Multimedia, MM ’24, page 3994–4003, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 9798400706868. doi: 10.11...

work page doi:10.1145/3664647.3681296 2024
[46]

Wang, A.C

Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, April 2004. ISSN 1941-0042. doi: 10.1109/TIP.2003.819861. Conference Name: IEEE Transactions on Image Processing

work page doi:10.1109/tip.2003.819861 2004
[47]

Imagenet classification with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/ c399862d3b9...

work page 2012
[48]

Rethinking the Inception Architecture for Computer Vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision.CoRR, abs/1512.00567, 2015. URL http://arxiv.org/abs/1512.00567

work page internal anchor Pith review Pith/arXiv arXiv 2015
[49]

Mingxing Tan and Quoc V . Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 ofProceedings of Machine Learning Research, pages 6105–6114. ...

work page 2019
[50]

Unsupervised learning of visual features by contrasting cluster assignments.CoRR, abs/2006.09882, 2020

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments.CoRR, abs/2006.09882, 2020. URLhttps://arxiv.org/abs/2006.09882

work page arXiv 2006
[51]

A perceptually based comparison of image similarity metrics

Pawan Sinha and Richard Russell. A perceptually based comparison of image similarity metrics. Perception, 40(11):1269–1281, 2011. doi: 10.1068/p7063. URL https://doi.org/10. 1068/p7063. PMID: 22416586

work page doi:10.1068/p7063 2011
[52]

Pick-a-pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview. net/forum?id=G5RwHpBUv0

work page 2023
[53]

Aoki, Kei Majima, Yusuke Muraki, and Yukiyasu Kamitani

Ken Shirakawa, Yoshihiro Nagano, Misato Tanaka, Shuntaro C. Aoki, Kei Majima, Yusuke Muraki, and Yukiyasu Kamitani. Spurious reconstruction from brain activity: The thin line between reconstruction, classification, and hallucination.Journal of Vision, 2024. URL https://api.semanticscholar.org/CorpusID:269791182

work page 2024
[54]

Brainbits: How much of the brain are generative reconstruction methods using? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

David Mayo, Christopher Wang, Asa Harbin, Abdulrahman Alabdulkareem, Albert Eaton Shaw, Boris Katz, and Andrei Barbu. Brainbits: How much of the brain are generative reconstruction methods using? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=KAAUvi4kpb

work page 2024
[55]

Mental imagery in emotion and emotional disorders

Emily A Holmes and Andrew Mathews. Mental imagery in emotion and emotional disorders. Clinical psychology review, 30(3):349–362, 2010

work page 2010
[56]

Giacino and Kathleen Kalmar

Joseph T. Giacino and Kathleen Kalmar. The vegetative and minimally conscious states: A comparison of clinical features and functional outcome.Journal of Head Trauma Rehabilitation, 12(4):36–51, 1997. doi: 10.1097/00001199-199708000-00005

work page doi:10.1097/00001199-199708000-00005 1997
[57]

Spencer, Catherine J

Brian L Edlow, Camille Chatelle, Camille A. Spencer, Catherine J. Chu, Yelena G. Bodien, Kathryn L. O’Connor, Ronald E. Hirschberg, Leigh R. Hochberg, Joseph T. Giacino, Eric S. Rosenthal, and et al. Early detection of consciousness in patients with acute severe traumatic brain injury.Brain, 140(9):2399–2414, 2017. doi: 10.1093/brain/awx176

work page doi:10.1093/brain/awx176 2017
[58]

Turgeon, François Lauzier, Jean-François Simard, Damon C

Alexis F. Turgeon, François Lauzier, Jean-François Simard, Damon C. Scales, Karen E.A. Burns, Lynne Moore, David A. Zygun, Francis Bernard, Maureen O. Meade, Tran Cong Dung, and et al. Mortality associated with withdrawal of life-sustaining therapy for patients with severe traumatic brain injury: A canadian multicentre cohort study.Canadian Medical Associ...

work page doi:10.1503/cmaj.101786 2011
[59]

Mures, anu

Livia Livint, Popa, Diana Chira, S, tefan Strilciuc, and Dafin F. Mures, anu. Non-invasive systems application in traumatic brain injury rehabilitation.Brain Sciences, 13(11), 2023. ISSN 2076-

work page 2023
[60]

URL https://www.mdpi.com/2076-3425/13/11/ 1594

doi: 10.3390/brainsci13111594. URL https://www.mdpi.com/2076-3425/13/11/ 1594

work page doi:10.3390/brainsci13111594 2076
[61]

Shiyu Luo, Qinwan Rabbani, and Nathan E. Crone. Brain-computer interface: Applications to speech decoding and synthesis to augment communication.Neurotherapeutics, 19(1):263–273, Jan 2022. doi: 10.1007/s13311-022-01190-2

work page doi:10.1007/s13311-022-01190-2 2022
[62]

Vansteensel, Sandra M

Evan Canny, Mariska J. Vansteensel, Sandra M. van der Salm, Gernot R. Müller-Putz, and Julia Berezutskaya. Boosting brain–computer interfaces with functional electrical stimulation: Potential applications in people with locked-in syndrome.Journal of NeuroEngineering and Rehabilitation, 20(1), Nov 2023. doi: 10.1186/s12984-023-01272-y. 16

work page doi:10.1186/s12984-023-01272-y 2023
[63]

Gordon and Anil K

Emma C. Gordon and Anil K. Seth. Ethical considerations for the use of brain–computer interfaces for cognitive enhancement.PLOS Biology, 22(10):1–15, 10 2024. doi: 10.1371/ journal.pbio.3002899. URLhttps://doi.org/10.1371/journal.pbio.3002899

work page doi:10.1371/journal.pbio.3002899 2024
[64]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10602-1

work page 2014
[65]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

work page 2023
[66]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, June 2024

work page 2024
[67]

Reproducible scaling laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023

work page 2023
[68]

Neural networks and the bias/variance dilemma.Neural Computation, 4(1):1–58, Jan 1992

Stuart Geman, Elie Bienenstock, and René Doursat. Neural networks and the bias/variance dilemma.Neural Computation, 4(1):1–58, Jan 1992. doi: 10.1162/neco.1992.4.1.1

work page doi:10.1162/neco.1992.4.1.1 1992
[69]

Hoerl and Robert W

Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970. ISSN 00401706. URL http://www.jstor. org/stable/1267351

work page arXiv 1970
[70]

doi: 10.1038/s42256-023-00753-y

Aria Y . Wang, Kendrick Kay, Thomas Naselaris, Michael J. Tarr, and Leila Wehbe. Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset.Nature Machine Intelligence, 5(12):1415–1426, December 2023. ISSN 2522-5839. doi: 10.1038/s42256-023-00753-y. Publisher Copyright: 2023, The Author(s), un...

work page doi:10.1038/s42256-023-00753-y 2023
[71]

Würstchen: An efficient architecture for large-scale text-to-image diffusion mod- els

Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubre- ville. Würstchen: An efficient architecture for large-scale text-to-image diffusion mod- els. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gU58d5QeGv

work page 2024
[72]

GIT: A generative image-to-text transformer for vision and language.Transactions on Machine Learning Research, 2022

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. GIT: A generative image-to-text transformer for vision and language.Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=b4tMhpN0JC

work page 2022
[73]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Image synthesis and editing with stochastic differential equations.CoRR, abs/2108.01073, 2021. URLhttps://arxiv.org/abs/2108.01073

work page internal anchor Pith review Pith/arXiv arXiv 2021
[74]

Brain Captioning: Decoding human brain activity into images and text, May 2023

Matteo Ferrante, Furkan Ozcelik, Tommaso Boccato, Rufin VanRullen, and Nicola Toschi. Brain Captioning: Decoding human brain activity into images and text, May 2023. URL http://arxiv.org/abs/2305.11560. arXiv:2305.11560 [cs]

work page arXiv 2023
[75]

distractor

Ghislain St-Yves, Emily J. Allen, Yihan Wu, Kendrick Kay, and Thomas Naselaris. Brain- optimized deep neural network models of human visual areas learn non-hierarchical repre- sentations.Nature Communications, 14(1):3329, 2023. ISSN 2041-1723. doi: 10.1038/ s41467-023-38674-4. URLhttps://doi.org/10.1038/s41467-023-38674-4. 17 Supporting information S1 Tex...

work page doi:10.1038/s41467-023-38674-4 2023
[76]

During our initial trials, normalization led to unexpected color distortions in the reconstructed images

Normalization of images:We disabled normalization of images when computing VGG19 features. During our initial trials, normalization led to unexpected color distortions in the reconstructed images. Removing normalization allowed the reconstructions to maintain their original color integrity, which is particularly crucial for visual comparisons in tasks req...

work page
[77]

For clarity, Derivative Works do not include the output of any Model

Feature decoding with Ridge Regression:Instead of the fastl2lir library, we employed the Ridge Regression implementation from the sklearn library. This change enhanced compatibility with the rest of our workflow and provided better support for managing memory-intensive computations. For VGG19 layers with a large feature space, feature decoding was perform...

work page

[1] [1]

Scotti, Mihir Tripathy, Cesare Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A

Paul S. Scotti, Mihir Tripathy, Cesare Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A. Nor- man, and Tanishq Mathew Abraham. Mindeye2: shared-subject models enable fmri-to-image with 1 hour of data. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024

[2] [2]

Scotti, Ghislain St-Yves, Jesse Breedlove, Kendrick Kay, and Thomas Naselaris

Reese Kneeland, Paul S. Scotti, Ghislain St-Yves, Jesse Breedlove, Kendrick Kay, and Thomas Naselaris. Nsd-imagery: A benchmark dataset for extending fmri vision decoding methods to mental imagery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 28852–28862, June 2025

work page 2025

[3] [4]

Oxford University Press, 2006

Stephen M Kosslyn, William L Thompson, and Giorgio Ganis.The case for mental imagery. Oxford University Press, 2006. 12

work page 2006

[4] [5]

Top-Down Activation of Shape-Specific Population Codes in Visual Cortex during Mental Imagery.Journal of Neuroscience, 29(5):1565–1572, February 2009

Mark Stokes, Russell Thompson, Rhodri Cusack, and John Duncan. Top-Down Activation of Shape-Specific Population Codes in Visual Cortex during Mental Imagery.Journal of Neuroscience, 29(5):1565–1572, February 2009. ISSN 0270-6474, 1529-2401. doi: 10.1523/ JNEUROSCI.4657-08.2009. URL https://www.jneurosci.org/content/29/5/1565. Publisher: Society for Neuros...

work page 2009

[5] [6]

Reading Imagined Letter Shapes from the Mind’s Eye Using Real-time 7 Tesla fMRI

Rainer Goebel, Rick van Hoof, Salil Bhat, Michael Lührs, and Mario Senden. Reading Imagined Letter Shapes from the Mind’s Eye Using Real-time 7 Tesla fMRI. In2022 10th International Winter Conference on Brain-Computer Interface (BCI), pages 1–3, February 2022. doi: 10.1109/BCI53720.2022.9735031. ISSN: 2572-7672

work page doi:10.1109/bci53720.2022.9735031 2022

[6] [7]

Olman, Dustin E

Thomas Naselaris, Cheryl A. Olman, Dustin E. Stansbury, Kamil Ugurbil, and Jack L. Gallant. A voxel-wise encoding model for early visual areas decodes mental images of remembered scenes.NeuroImage, 105:215–228, January 2015. ISSN 1053-8119. doi: 10.1016/j.neuroimage.2014.10.018. URL https://www.sciencedirect.com/science/ article/pii/S1053811914008428

work page doi:10.1016/j.neuroimage.2014.10.018 2015

[7] [8]

Reading the mind’s eye: Decoding category information during mental imagery.NeuroImage, 50(2):818–825, April 2010

Leila Reddy, Naotsugu Tsuchiya, and Thomas Serre. Reading the mind’s eye: Decoding category information during mental imagery.NeuroImage, 50(2):818–825, April 2010. ISSN 1053-8119. doi: 10.1016/j.neuroimage.2009.11.084. URL https://www.sciencedirect. com/science/article/pii/S1053811909012701

work page doi:10.1016/j.neuroimage.2009.11.084 2010

[8] [9]

Disentangling visual imagery and perception of real-world objects.Neuroimage, 59(4):4064–4073, 2012

Sue-Hyun Lee, Dwight J Kravitz, and Chris I Baker. Disentangling visual imagery and perception of real-world objects.Neuroimage, 59(4):4064–4073, 2012

work page 2012

[9] [10]

The human imagination: the cognitive neuroscience of visual mental imagery

Joel Pearson. The human imagination: the cognitive neuroscience of visual mental imagery. Nature reviews neuroscience, 20(10):624–634, 2019

work page 2019

[10] [11]

Comparison of signal to noise in vision and imagery for qualitatively different kinds of stimuli

Tiasha Saha Roy, Jesse Breedlove, Ghislain St-Yves, Kendrick Kay, and Thomas Naselaris. Comparison of signal to noise in vision and imagery for qualitatively different kinds of stimuli. Journal of Vision, 23(9):5961, 2023. ISSN 1534-7362. doi: 10.1167/jov.23.9.5961. URL https://doi.org/10.1167/jov.23.9.5961

work page doi:10.1167/jov.23.9.5961 2023

[11] [12]

Mental imagery: Weak vision or compressed vision? InConference on Cognitive Computational Neuroscience, 2023

Tiasha Saha Roy, Jesse Breedlove, Ghislain St-Yves, Kendrick Kay, and Thomas Naselaris. Mental imagery: Weak vision or compressed vision? InConference on Cognitive Computational Neuroscience, 2023. doi: 10.32470/CCN.2023.1693-0. URL https://2023.ccneuro.org/ view_paper4eea.html?PaperNum=1693

work page doi:10.32470/ccn.2023.1693-0 2023

[12] [13]

Breedlove, Ghislain St-Yves, Cheryl A

Jesse L. Breedlove, Ghislain St-Yves, Cheryl A. Olman, and Thomas Naselaris. Generative feedback explains distinct brain activity codes for seen and mental images.Current Biology, 30 (12):2211–2224.e6, 2020. ISSN 0960-9822. doi: https://doi.org/10.1016/j.cub.2020.04.014. URLhttps://www.sciencedirect.com/science/article/pii/S0960982220304942

work page doi:10.1016/j.cub.2020.04.014 2020

[13] [14]

Spatial perception and memory have distinct activation profiles in human visual cortex.BioRxiv, page 811331, 2019

Serra E Favila, Brice A Kuhl, and Jonathan Winawer. Spatial perception and memory have distinct activation profiles in human visual cortex.BioRxiv, page 811331, 2019

work page 2019

[14] [15]

Imagery and perception share cortical representations of content and location.Cerebral cortex, 22(2):372–380, 2012

Radoslaw M Cichy, Jakob Heinzle, and John-Dylan Haynes. Imagery and perception share cortical representations of content and location.Cerebral cortex, 22(2):372–380, 2012

work page 2012

[15] [16]

Shared representations for working memory and mental imagery in early visual cortex.Current Biology, 23(15):1427–1431, 2013

Anke Marit Albers, Peter Kok, Ivan Toni, H Chris Dijkerman, and Floris P De Lange. Shared representations for working memory and mental imagery in early visual cortex.Current Biology, 23(15):1427–1431, 2013

work page 2013

[16] [17]

Do better models of fmri visual response better predict mental imagery responses? InConference on Cognitive Computational Neuroscience, 2023

Ghislain St-Yves, Jesse Breedlove, Kendrick Kay, and Thomas Naselaris. Do better models of fmri visual response better predict mental imagery responses? InConference on Cognitive Computational Neuroscience, 2023. doi: 10.32470/CCN.2023.1644-0. URL https://2023. ccneuro.org/view_paper37c6.html?PaperNum=1644

work page doi:10.32470/ccn.2023.1644-0 2023

[17] [18]

Inverse retinotopy: Inferring the visual content of images from brain activation patterns.NeuroImage, 33(4):1104–1116, December 2006

Bertrand Thirion, Edouard Duchesnay, Edward Hubbard, Jessica Dubois, Jean-Baptiste Poline, Denis Lebihan, and Stanislas Dehaene. Inverse retinotopy: Inferring the visual content of images from brain activation patterns.NeuroImage, 33(4):1104–1116, December 2006. ISSN 10538119. doi: 10.1016/j.neuroimage.2006.06.062. URL https://linkinghub.elsevier. com/ret...

work page doi:10.1016/j.neuroimage.2006.06.062 2006

[18] [19]

Emmerling, Rick van Hoof, Martin A

Mario Senden, Thomas C. Emmerling, Rick van Hoof, Martin A. Frost, and Rainer Goebel. Reconstructing imagined letters from early visual cortex reveals tight topographic correspon- dence between visual mental imagery and perception.Brain Structure and Function, 224(3): 1167–1183, Jan 2019. doi: 10.1007/s00429-019-01828-6

work page doi:10.1007/s00429-019-01828-6 2019

[19] [20]

Hongmi Lee and Brice A. Kuhl. Reconstructing perceived and retrieved faces from activity patterns in lateral parietal cortex.Journal of Neuroscience, 36(22):6069–6082, 2016. Publisher: Soc Neuroscience

work page 2016

[20] [21]

Deep image recon- struction from human brain activity.PLOS Computational Biology, 15(1):e1006633, January

Guohua Shen, Tomoyasu Horikawa, Kei Majima, and Yukiyasu Kamitani. Deep image recon- struction from human brain activity.PLOS Computational Biology, 15(1):e1006633, January

work page

[21] [22]

doi: 10.1371/journal.pcbi.1006633

ISSN 1553-7358. doi: 10.1371/journal.pcbi.1006633. URL https://dx.plos.org/ 10.1371/journal.pcbi.1006633

work page doi:10.1371/journal.pcbi.1006633

[22] [23]

Mental image reconstruction from human brain activity: Neural decoding of mental imagery via deep neural network-based bayesian estimation.Neural Networks, 170:349–363, 2024

Naoko Koide-Majima, Shinji Nishimoto, and Kei Majima. Mental image reconstruction from human brain activity: Neural decoding of mental imagery via deep neural network-based bayesian estimation.Neural Networks, 170:349–363, 2024. ISSN 0893-6080. doi: https:// doi.org/10.1016/j.neunet.2023.11.024. URL https://www.sciencedirect.com/science/ article/pii/S0893...

work page doi:10.1016/j.neunet.2023.11.024 2024

[23] [24]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Confer- ence on Machin...

work page 2021

[24] [25]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models.CoRR, abs/2112.10752, 2021. URL https://arxiv.org/abs/2112.10752

work page internal anchor Pith review Pith/arXiv arXiv 2021

[25] [26]

Allen, Ghislain St-Yves, Yihan Wu, Jesse L

Emily J. Allen, Ghislain St-Yves, Yihan Wu, Jesse L. Breedlove, Jacob S. Prince, Logan T. Dowdle, Matthias Nau, Brad Caron, Franco Pestilli, Ian Charest, J. Benjamin Hutchin- son, Thomas Naselaris, and Kendrick Kay. A massive 7T fMRI dataset to bridge cog- nitive neuroscience and artificial intelligence.Nature Neuroscience, 25(1):116–126, Jan- uary 2022. ...

work page doi:10.1038/s41593-021-00962-x 2022

[26] [27]

High-resolution image reconstruction with latent diffusion models from human brain activity

Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023

work page 2023

[27] [28]

Improving visual image reconstruction from human brain activity using latent diffusion models via multiple decoded inputs, 2023

Yu Takagi and Shinji Nishimoto. Improving visual image reconstruction from human brain activity using latent diffusion models via multiple decoded inputs, 2023

work page 2023

[28] [29]

Natural scene reconstruction from fmri signals using gen- erative latent diffusion.Scientific Reports, 13, 2023

Furkan Ozcelik and Rufin VanRullen. Natural scene reconstruction from fmri signals using gen- erative latent diffusion.Scientific Reports, 13, 2023. URL https://api.semanticscholar. org/CorpusID:260439960

work page 2023

[29] [30]

Reconstructing the mind’s eye: fMRI-to-image with contrastive learning and diffusion priors

Paul Steven Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen, Cohen Ethan, Aidan James Dempster, Nathalie Verlinde, Elad Yundler, David Weisberg, Kenneth Norman, and Tanishq Mathew Abraham. Reconstructing the mind’s eye: fMRI-to-image with contrastive learning and diffusion priors. InThirty-seventh Conference on Neural Information Pro...

work page 2023

[30] [31]

Reconstructing seen images from human brain activity via guided stochastic search

Reese Kneeland, Jordyn Ojeda, Ghislain St-Yves, and Thomas Naselaris. Reconstructing seen images from human brain activity via guided stochastic search. InConference on Cognitive Computational Neuroscience, 2023. doi: 10.32470/CCN.2023.1672-0. URL https://2023. ccneuro.org/view_paper1337.html?PaperNum=1672

work page doi:10.32470/ccn.2023.1672-0 2023

[31] [32]

Second Sight: Using brain-optimized encoding models to align image distributions with human brain activity, June

Reese Kneeland, Jordyn Ojeda, Ghislain St-Yves, and Thomas Naselaris. Second Sight: Using brain-optimized encoding models to align image distributions with human brain activity, June

work page

[32] [33]

arXiv:2306.00927 [cs, q-bio]

URLhttp://arxiv.org/abs/2306.00927. arXiv:2306.00927 [cs, q-bio]. 14

work page arXiv

[33] [34]

Brain-optimized inference improves reconstructions of fMRI brain activity, December 2023

Reese Kneeland, Jordyn Ojeda, Ghislain St-Yves, and Thomas Naselaris. Brain-optimized inference improves reconstructions of fMRI brain activity, December 2023. URL http: //arxiv.org/abs/2312.07705. arXiv:2312.07705 [cs, q-bio]

work page arXiv 2023

[34] [35]

Through their eyes: multi-subject brain decoding with simple alignment techniques.Imaging Neuroscience, 2, 04 2024

Matteo Ferrante, Tommaso Boccato, Furkan Ozcelik, Rufin VanRullen, and Nicola Toschi. Through their eyes: multi-subject brain decoding with simple alignment techniques.Imaging Neuroscience, 2, 04 2024. doi: 10.1162/imag_a_00170

work page doi:10.1162/imag_a_00170 2024

[35] [36]

Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding

Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, and Juan Helen Zhou. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22710–22720, 2022. URLhttps://api.semanticscholar.org/CorpusID:253510456

work page 2023

[36] [37]

Contrast, Attend and Diffuse to Decode High-Resolution Images from Brain Activities, December 2023

Jingyuan Sun, Mingxiao Li, Zijiao Chen, Yunhao Zhang, Shaonan Wang, and Marie-Francine Moens. Contrast, Attend and Diffuse to Decode High-Resolution Images from Brain Activities, December 2023. URLhttp://arxiv.org/abs/2305.17214. arXiv:2305.17214 [cs]

work page arXiv 2023

[37] [38]

UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Human Brain Activity, August 2023

Weijian Mai and Zhijun Zhang. UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Human Brain Activity, August 2023. URL http://arxiv.org/ abs/2308.07428. arXiv:2308.07428 [cs]

work page arXiv 2023

[38] [39]

Very deep {vae}s generalize autoregressive models and can outperform them on images

Rewon Child. Very deep {vae}s generalize autoregressive models and can outperform them on images. InInternational Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=RLRXCV6DbEJ

work page 2021

[39] [40]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=di52zR8xgf

work page 2024

[40] [41]

Versatile diffusion: Text, images and variations all in one diffusion model.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7720–7731, 2022

Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. Versatile diffusion: Text, images and variations all in one diffusion model.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7720–7731, 2022. URLhttps://api.semanticscholar. org/CorpusID:253523371

work page 2023

[41] [42]

Wang, Kendrick Kay, Thomas Naselaris, Michael J

Aria Y . Wang, Kendrick Kay, Thomas Naselaris, Michael J. Tarr, and Leila Wehbe. Incorporating natural language into vision models improves prediction and understanding of higher visual cortex, September 2022. URL https://www.biorxiv.org/content/10.1101/2022.09. 27.508760v1. Pages: 2022.09.27.508760 Section: New Results

work page doi:10.1101/2022.09 2022

[42] [43]

Mindbridge: A cross-subject brain decoding framework

Shizun Wang, Songhua Liu, Zhenxiong Tan, and Xinchao Wang. Mindbridge: A cross-subject brain decoding framework. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11333–11342, 2024

work page 2024

[43] [44]

Neuropictor: Refining fmri-to-image reconstruction via multi-individual pretraining and multi-level modulation, 2024

Jingyang Huo, Yikai Wang, Xuelin Qian, Yun Wang, Chong Li, Jianfeng Feng, and Yanwei Fu. Neuropictor: Refining fmri-to-image reconstruction via multi-individual pretraining and multi-level modulation, 2024

work page 2024

[44] [45]

Brainram: Cross- modality retrieval-augmented image reconstruction from human brain activity

Dian Xie, Peiang Zhao, Jiarui Zhang, Kangqi Wei, Xiaobao Ni, and Jiong Xia. Brainram: Cross- modality retrieval-augmented image reconstruction from human brain activity. InProceedings of the 32nd ACM International Conference on Multimedia, MM ’24, page 3994–4003, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 9798400706868. doi: 10.11...

work page doi:10.1145/3664647.3681296 2024

[45] [46]

Wang, A.C

Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, April 2004. ISSN 1941-0042. doi: 10.1109/TIP.2003.819861. Conference Name: IEEE Transactions on Image Processing

work page doi:10.1109/tip.2003.819861 2004

[46] [47]

Imagenet classification with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/ c399862d3b9...

work page 2012

[47] [48]

Rethinking the Inception Architecture for Computer Vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision.CoRR, abs/1512.00567, 2015. URL http://arxiv.org/abs/1512.00567

work page internal anchor Pith review Pith/arXiv arXiv 2015

[48] [49]

Mingxing Tan and Quoc V . Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 ofProceedings of Machine Learning Research, pages 6105–6114. ...

work page 2019

[49] [50]

Unsupervised learning of visual features by contrasting cluster assignments.CoRR, abs/2006.09882, 2020

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments.CoRR, abs/2006.09882, 2020. URLhttps://arxiv.org/abs/2006.09882

work page arXiv 2006

[50] [51]

A perceptually based comparison of image similarity metrics

Pawan Sinha and Richard Russell. A perceptually based comparison of image similarity metrics. Perception, 40(11):1269–1281, 2011. doi: 10.1068/p7063. URL https://doi.org/10. 1068/p7063. PMID: 22416586

work page doi:10.1068/p7063 2011

[51] [52]

Pick-a-pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview. net/forum?id=G5RwHpBUv0

work page 2023

[52] [53]

Aoki, Kei Majima, Yusuke Muraki, and Yukiyasu Kamitani

Ken Shirakawa, Yoshihiro Nagano, Misato Tanaka, Shuntaro C. Aoki, Kei Majima, Yusuke Muraki, and Yukiyasu Kamitani. Spurious reconstruction from brain activity: The thin line between reconstruction, classification, and hallucination.Journal of Vision, 2024. URL https://api.semanticscholar.org/CorpusID:269791182

work page 2024

[53] [54]

Brainbits: How much of the brain are generative reconstruction methods using? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

David Mayo, Christopher Wang, Asa Harbin, Abdulrahman Alabdulkareem, Albert Eaton Shaw, Boris Katz, and Andrei Barbu. Brainbits: How much of the brain are generative reconstruction methods using? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=KAAUvi4kpb

work page 2024

[54] [55]

Mental imagery in emotion and emotional disorders

Emily A Holmes and Andrew Mathews. Mental imagery in emotion and emotional disorders. Clinical psychology review, 30(3):349–362, 2010

work page 2010

[55] [56]

Giacino and Kathleen Kalmar

Joseph T. Giacino and Kathleen Kalmar. The vegetative and minimally conscious states: A comparison of clinical features and functional outcome.Journal of Head Trauma Rehabilitation, 12(4):36–51, 1997. doi: 10.1097/00001199-199708000-00005

work page doi:10.1097/00001199-199708000-00005 1997

[56] [57]

Spencer, Catherine J

Brian L Edlow, Camille Chatelle, Camille A. Spencer, Catherine J. Chu, Yelena G. Bodien, Kathryn L. O’Connor, Ronald E. Hirschberg, Leigh R. Hochberg, Joseph T. Giacino, Eric S. Rosenthal, and et al. Early detection of consciousness in patients with acute severe traumatic brain injury.Brain, 140(9):2399–2414, 2017. doi: 10.1093/brain/awx176

work page doi:10.1093/brain/awx176 2017

[57] [58]

Turgeon, François Lauzier, Jean-François Simard, Damon C

Alexis F. Turgeon, François Lauzier, Jean-François Simard, Damon C. Scales, Karen E.A. Burns, Lynne Moore, David A. Zygun, Francis Bernard, Maureen O. Meade, Tran Cong Dung, and et al. Mortality associated with withdrawal of life-sustaining therapy for patients with severe traumatic brain injury: A canadian multicentre cohort study.Canadian Medical Associ...

work page doi:10.1503/cmaj.101786 2011

[58] [59]

Mures, anu

Livia Livint, Popa, Diana Chira, S, tefan Strilciuc, and Dafin F. Mures, anu. Non-invasive systems application in traumatic brain injury rehabilitation.Brain Sciences, 13(11), 2023. ISSN 2076-

work page 2023

[59] [60]

URL https://www.mdpi.com/2076-3425/13/11/ 1594

doi: 10.3390/brainsci13111594. URL https://www.mdpi.com/2076-3425/13/11/ 1594

work page doi:10.3390/brainsci13111594 2076

[60] [61]

Shiyu Luo, Qinwan Rabbani, and Nathan E. Crone. Brain-computer interface: Applications to speech decoding and synthesis to augment communication.Neurotherapeutics, 19(1):263–273, Jan 2022. doi: 10.1007/s13311-022-01190-2

work page doi:10.1007/s13311-022-01190-2 2022

[61] [62]

Vansteensel, Sandra M

Evan Canny, Mariska J. Vansteensel, Sandra M. van der Salm, Gernot R. Müller-Putz, and Julia Berezutskaya. Boosting brain–computer interfaces with functional electrical stimulation: Potential applications in people with locked-in syndrome.Journal of NeuroEngineering and Rehabilitation, 20(1), Nov 2023. doi: 10.1186/s12984-023-01272-y. 16

work page doi:10.1186/s12984-023-01272-y 2023

[62] [63]

Gordon and Anil K

Emma C. Gordon and Anil K. Seth. Ethical considerations for the use of brain–computer interfaces for cognitive enhancement.PLOS Biology, 22(10):1–15, 10 2024. doi: 10.1371/ journal.pbio.3002899. URLhttps://doi.org/10.1371/journal.pbio.3002899

work page doi:10.1371/journal.pbio.3002899 2024

[63] [64]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10602-1

work page 2014

[64] [65]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

work page 2023

[65] [66]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, June 2024

work page 2024

[66] [67]

Reproducible scaling laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023

work page 2023

[67] [68]

Neural networks and the bias/variance dilemma.Neural Computation, 4(1):1–58, Jan 1992

Stuart Geman, Elie Bienenstock, and René Doursat. Neural networks and the bias/variance dilemma.Neural Computation, 4(1):1–58, Jan 1992. doi: 10.1162/neco.1992.4.1.1

work page doi:10.1162/neco.1992.4.1.1 1992

[68] [69]

Hoerl and Robert W

Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970. ISSN 00401706. URL http://www.jstor. org/stable/1267351

work page arXiv 1970

[69] [70]

doi: 10.1038/s42256-023-00753-y

Aria Y . Wang, Kendrick Kay, Thomas Naselaris, Michael J. Tarr, and Leila Wehbe. Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset.Nature Machine Intelligence, 5(12):1415–1426, December 2023. ISSN 2522-5839. doi: 10.1038/s42256-023-00753-y. Publisher Copyright: 2023, The Author(s), un...

work page doi:10.1038/s42256-023-00753-y 2023

[70] [71]

Würstchen: An efficient architecture for large-scale text-to-image diffusion mod- els

Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubre- ville. Würstchen: An efficient architecture for large-scale text-to-image diffusion mod- els. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gU58d5QeGv

work page 2024

[71] [72]

GIT: A generative image-to-text transformer for vision and language.Transactions on Machine Learning Research, 2022

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. GIT: A generative image-to-text transformer for vision and language.Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=b4tMhpN0JC

work page 2022

[72] [73]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Image synthesis and editing with stochastic differential equations.CoRR, abs/2108.01073, 2021. URLhttps://arxiv.org/abs/2108.01073

work page internal anchor Pith review Pith/arXiv arXiv 2021

[73] [74]

Brain Captioning: Decoding human brain activity into images and text, May 2023

Matteo Ferrante, Furkan Ozcelik, Tommaso Boccato, Rufin VanRullen, and Nicola Toschi. Brain Captioning: Decoding human brain activity into images and text, May 2023. URL http://arxiv.org/abs/2305.11560. arXiv:2305.11560 [cs]

work page arXiv 2023

[74] [75]

distractor

Ghislain St-Yves, Emily J. Allen, Yihan Wu, Kendrick Kay, and Thomas Naselaris. Brain- optimized deep neural network models of human visual areas learn non-hierarchical repre- sentations.Nature Communications, 14(1):3329, 2023. ISSN 2041-1723. doi: 10.1038/ s41467-023-38674-4. URLhttps://doi.org/10.1038/s41467-023-38674-4. 17 Supporting information S1 Tex...

work page doi:10.1038/s41467-023-38674-4 2023

[75] [76]

During our initial trials, normalization led to unexpected color distortions in the reconstructed images

Normalization of images:We disabled normalization of images when computing VGG19 features. During our initial trials, normalization led to unexpected color distortions in the reconstructed images. Removing normalization allowed the reconstructions to maintain their original color integrity, which is particularly crucial for visual comparisons in tasks req...

work page

[76] [77]

For clarity, Derivative Works do not include the output of any Model

Feature decoding with Ridge Regression:Instead of the fastl2lir library, we employed the Ridge Regression implementation from the sklearn library. This change enhanced compatibility with the rest of our workflow and provided better support for managing memory-intensive computations. For VGG19 layers with a large feature space, feature decoding was perform...

work page