pith. sign in

arxiv: 2606.00121 · v1 · pith:6BYB5PUInew · submitted 2026-05-28 · 💻 cs.CV · cs.AI

Versatile Framework with Semantic and Structural guidance for Image Reconstruction from Brain Activity

Pith reviewed 2026-06-29 08:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords image reconstructionbrain activity decodingCLIPStable Diffusionsemantic guidancestructural guidancefMRIEEG
0
0 comments X

The pith

Two-stage framework reconstructs images from brain activity matching semantics and structure

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors present MindDiffuser, a versatile two-stage framework for reconstructing images from brain responses in fMRI, EEG, and MEG. In the first stage, CLIP text embeddings decoded from brain data drive Stable Diffusion to produce an image with correct semantics. In the second stage, shallow CLIP visual features decoded from the same brain responses serve as targets to iteratively update the feature vectors from stage one through backpropagation, enforcing structural consistency. This dual guidance overcomes the structural shortcomings of prior semantic-only approaches. Experiments across datasets confirm gains over existing methods and visualizations indicate biological relevance.

Core claim

The paper claims that inputting decoded CLIP text embeddings into Stable Diffusion for semantic generation, followed by backpropagation-based refinement using decoded shallow CLIP visual features, produces image reconstructions from brain activity that align with the original stimuli in both semantic concepts and fine-grained structural details such as position, orientation, and size.

What carries the argument

MindDiffuser's two-stage process: semantic generation from text embeddings in diffusion, followed by structural alignment via backprop on visual features.

If this is right

  • Reconstructions achieve better fine-grained structural consistency with original stimuli.
  • The approach works across fMRI, EEG, and MEG brain signal types.
  • Previous state-of-the-art models see significant performance improvements.
  • Spatial and temporal visualizations support neurobiological plausibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the refinement technique to other modalities or tasks could enhance controllability in brain-computer interfaces.
  • Testing whether the structural refinement preserves semantics across different generative backbones would validate the separation of concerns.
  • The method implies that brain signals contain separable semantic and structural information extractable via CLIP.

Load-bearing premise

Decoded shallow CLIP visual features accurately capture the structural information of the viewed image and can guide refinement without introducing distortions or artifacts.

What would settle it

If Stage 2 refinement results in images with reduced structural similarity to the ground truth compared to Stage 1 outputs, as quantified by metrics for position, size, or orientation match.

read the original abstract

Reconstructing visual stimuli from brain recordings has been a meaningful and challenging task in brain decoding. Especially, the achievement of precise and controllable image reconstruction bears great significance in propelling the progress and utilization of brain-computer interfaces. Recent methods, leveraging advances in the power of text-to-image generation models, have reconstructed images that closely approximate complex natural stimuli in terms of semantics (e.g., concepts and objects). However, they struggle to maintain consistency with the original stimuli in fine-grained structural information (e.g., position, orientation and size), which undermines both the controllability and interpretability of the models. To address the aforementioned issues, we propose a two-stage image reconstruction framework, termed MindDiffuser. In Stage 1, Contrastive Language-Image Pretraining (CLIP) text embeddings decoded from brain responses are input into Stable Diffusion, generating a preliminary image containing semantic information. In Stage 2, we use decoded shallow CLIP visual features as supervisory signals, iteratively refining the feature vectors from Stage 1 via backpropagation to align structural information. We conducted extensive experiments on brain response datasets across three modalities (fMRI, EEG, MEG) elicited by visual stimuli, demonstrating that our framework significantly enhances the performance of previous state-of-the-art models, highlighting the effectiveness and versatility of our approach. Spatial and temporal visualization results further support the neurobiological plausibility of our framework, providing guidance for future neural decoding efforts across different brain signal modalities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes MindDiffuser, a two-stage framework for image reconstruction from brain recordings (fMRI, EEG, MEG). Stage 1 decodes CLIP text embeddings from brain responses and feeds them into Stable Diffusion to produce a semantically plausible image. Stage 2 treats decoded shallow CLIP visual features as supervisory signals and iteratively refines the Stage-1 latent vectors via backpropagation to enforce structural consistency (position, orientation, size). The authors claim that this yields significant improvements over prior state-of-the-art methods across three modalities and that spatial/temporal visualizations support neurobiological plausibility.

Significance. If the quantitative gains and the orthogonality of the structural signal can be demonstrated, the work would offer a practical way to combine semantic guidance from large text-to-image models with structural constraints derived from brain data. The reliance on off-the-shelf CLIP and Stable Diffusion plus standard backpropagation is a strength that lowers the barrier to reproducibility. The cross-modality evaluation is also potentially valuable for the field.

major comments (2)
  1. [Method (Stage 2)] Stage 2 description (method): the claim that back-propagating shallow CLIP visual features supplies accurate structural supervision rests on the untested assumption that these features (a) encode position/orientation/size information orthogonal to the text embedding and (b) remain sufficiently accurate when decoded from fMRI/EEG/MEG. No ablation, no pre-/post-refinement CLIP text-image similarity scores, and no quantitative check on semantic drift are reported, making the central performance claim load-bearing on an unverified premise.
  2. [Experiments] Experiments section: the abstract and results assert that the framework 'significantly enhances' prior SOTA performance, yet supply no numerical metrics, baseline tables, error bars, or explicit definition of how structural alignment is measured or validated. Without these data the improvement claim cannot be assessed.
minor comments (1)
  1. [Method] Notation for the two feature spaces (text embedding vs. shallow visual features) should be introduced with explicit symbols and kept consistent across equations and text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below with clarifications and commit to revisions that strengthen the evidence for our claims without altering the core contributions.

read point-by-point responses
  1. Referee: [Method (Stage 2)] Stage 2 description (method): the claim that back-propagating shallow CLIP visual features supplies accurate structural supervision rests on the untested assumption that these features (a) encode position/orientation/size information orthogonal to the text embedding and (b) remain sufficiently accurate when decoded from fMRI/EEG/MEG. No ablation, no pre-/post-refinement CLIP text-image similarity scores, and no quantitative check on semantic drift are reported, making the central performance claim load-bearing on an unverified premise.

    Authors: We agree that explicit validation of the orthogonality assumption and checks for semantic drift would strengthen the presentation. CLIP's architecture separates text and image encoders by design, with shallow visual features known to retain spatial information; however, to directly test this in our setting we will add an ablation study (with/without Stage 2) and report pre-/post-refinement CLIP text-image similarity scores plus semantic drift metrics in the revised manuscript. revision: yes

  2. Referee: [Experiments] Experiments section: the abstract and results assert that the framework 'significantly enhances' prior SOTA performance, yet supply no numerical metrics, baseline tables, error bars, or explicit definition of how structural alignment is measured or validated. Without these data the improvement claim cannot be assessed.

    Authors: The results section of the full manuscript contains quantitative tables comparing MindDiffuser to prior SOTA methods on fMRI, EEG and MEG data using FID, SSIM, CLIP semantic scores and structural metrics (bounding-box IoU for position/size, angular error for orientation), with error bars from cross-validation. We will revise to make these tables and metric definitions more prominent and add any omitted baseline details. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pre-trained models

full rationale

The paper's two-stage framework (Stage 1: decode CLIP text embeddings into Stable Diffusion; Stage 2: backpropagate decoded shallow CLIP visual features) applies standard external components (pre-trained CLIP, Stable Diffusion) and conventional optimization without any internal parameter fits that are then relabeled as predictions, without self-definitional loops, and without load-bearing self-citations. No equation or step reduces the claimed output to an input quantity by construction. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework depends on the established capabilities of pre-trained CLIP and Stable Diffusion models plus the assumption that backpropagation can align structure without side effects; no new entities are postulated and no free parameters are explicitly fitted in the abstract description.

axioms (2)
  • domain assumption CLIP can decode both text and shallow visual embeddings from brain responses that are useful for image generation and supervision
    Invoked for both Stage 1 text embeddings and Stage 2 visual feature supervision.
  • domain assumption Stable Diffusion can produce images from CLIP text embeddings that preserve semantic content of the original stimulus
    Basis for the preliminary image in Stage 1.

pith-pipeline@v0.9.1-grok · 5804 in / 1485 out tokens · 49585 ms · 2026-06-29T08:47:03.027568+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 10 canonical work pages · 3 internal anchors

  1. [1]

    Frontiers in neuroscience15, 795488 (2021)

    Rakhimberdina, Z., Jodelet, Q., Liu, X., Murata, T.: Natural image reconstruction from fMRI using deep learning: A survey. Frontiers in neuroscience15, 795488 (2021)

  2. [2]

    Machine Intelligence Research19(5), 439–455 (2022) Springer Nature 2025 LATEX template MindDiffuser25

    Zhou, Q., Du, C., He, H.: Exploring the brain-like properties of deep neural networks: a neural encoding perspective. Machine Intelligence Research19(5), 439–455 (2022) Springer Nature 2025 LATEX template MindDiffuser25

  3. [3]

    IEEE Transactions on Medical Imaging42(8), 2262– 2273 (2023)

    Huang, Z., Du, C., Wang, Y., Fu, K., He, H.: Graph-enhanced emotion neural decoding. IEEE Transactions on Medical Imaging42(8), 2262– 2273 (2023). https://doi.org/10.1109/TMI.2023.3246220

  4. [4]

    Machine Intelligence Research, 1–18 (2025)

    Zhou, Q., Du, C., Li, D., Wen, B., Chang, L., He, H.: Interpretable visual neural decoding with unsupervised semantic disentanglement. Machine Intelligence Research, 1–18 (2025)

  5. [5]

    IEEE Transactions on Neural Networks and Learning Systems33(2), 600– 614 (2020)

    Du, C., Du, C., Huang, L., Wang, H., He, H.: Structured neural decoding with multitask transfer learning of deep neural network representations. IEEE Transactions on Neural Networks and Learning Systems33(2), 600– 614 (2020)

  6. [6]

    IEEE transactions on neural networks and learning systems30(8), 2310–2323 (2018)

    Du, C., Du, C., Huang, L., He, H.: Reconstructing perceived images from human brain activities with Bayesian deep multiview learning. IEEE transactions on neural networks and learning systems30(8), 2310–2323 (2018)

  7. [7]

    Frontiers in Computational Neuroscience17, 1253234 (2024) https://doi.org/10.3389/fncom

    Shen, G., Dwivedi, K., Majima, K., Horikawa, T., Kamitani, Y.: End-to- end deep image reconstruction from human brain activity. Frontiers in Computational Neuroscience13(2019). https://doi.org/10.3389/fncom. 2019.00021

  8. [8]

    Advances in Neural Information Processing Systems32 (2019)

    Beliy, R., Gaziv, G., Hoogi, A., Strappini, F., Golan, T., Irani, M.: From voxels to pixels and back: Self-supervision in natural-image reconstruc- tion from fMRI. Advances in Neural Information Processing Systems32 (2019)

  9. [9]

    Advances in neural information processing systems32(2019)

    Donahue, J., Simonyan, K.: Large scale adversarial representation learn- ing. Advances in neural information processing systems32(2019)

  10. [10]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High- resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

  11. [11]

    Advances in Neural Information Processing Systems35, 29624–29636 (2022)

    Lin, S., Sprague, T., Singh, A.K.: Mind Reader: Reconstructing Complex Images from Brain Activities. Advances in Neural Information Processing Systems35, 29624–29636 (2022)

  12. [12]

    Scientific Reports13(1), 15666 (2023)

    Ozcelik, F., VanRullen, R.: Natural scene reconstruction from fmri signals using generative latent diffusion. Scientific Reports13(1), 15666 (2023)

  13. [13]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Takagi, Y., Nishimoto, S.: High-resolution image reconstruction with latent diffusion models from human brain activity. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14453–14463 (2023) Springer Nature 2025 LATEX template 26MindDiffuser

  14. [14]

    arXiv preprint arXiv:2212.02409 (2022)

    Gu, Z., Jamison, K., Kuceyeski, A., Sabuncu, M.: Decoding natural image stimuli from fmri data with a surface-based convolutional network. arXiv preprint arXiv:2212.02409 (2022)

  15. [15]

    Advances in Neural Information Processing Systems36 (2024)

    Scotti, P., Banerjee, A., Goode, J., Shabalin, S., Nguyen, A., Dempster, A., Verlinde, N., Yundler, E., Weisberg, D., Norman, K., et al.: Recon- structing the Mind’s Eye: fMRI-to-image with contrastive learning and diffusion priors. Advances in Neural Information Processing Systems36 (2024)

  16. [16]

    Journal of Neuroscience37(36), 8767–8782 (2017)

    Vaziri-Pashkam, M., Xu, Y.: Goal-directed visual processing differen- tially impacts human ventral and dorsal visual representations. Journal of Neuroscience37(36), 8767–8782 (2017)

  17. [17]

    Journal of Cognitive Neuroscience26(1), 189–209 (2014)

    Zachariou, V., Klatzky, R., Behrmann, M.: Ventral and dorsal visual stream contributions to the perception of object shape and object location. Journal of Cognitive Neuroscience26(1), 189–209 (2014)

  18. [18]

    In: Proceedings of the 31st ACM International Conference on Multimedia, pp

    Lu, Y., Du, C., Zhou, Q., Wang, D., He, H.: Minddiffuser: Controlled image reconstruction from human brain activity with semantic and struc- tural diffusion. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5899–5908 (2023)

  19. [19]

    Interna- tional Conference on Learning Representations (2014)

    Kingma, D.P., Welling, M.: Auto-Encoding Variational Bayes. Interna- tional Conference on Learning Representations (2014)

  20. [20]

    Communications of the ACM63(11), 139–144 (2020)

    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative Adversarial Networks. Communications of the ACM63(11), 139–144 (2020)

  21. [21]

    Journal of membrane science107(1-2), 1–21 (1995)

    Wijmans, J.G., Baker, R.W.: The solution-diffusion model: a review. Journal of membrane science107(1-2), 1–21 (1995)

  22. [22]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising Diffusion Probabilistic Models. Advances in neural information processing systems33, 6840–6851 (2020)

  23. [23]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchi- cal text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)

  24. [24]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y.: T2i- adapter: Learning adapters to dig out more controllable ability for text- to-image diffusion models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 4296–4304 (2024)

  25. [25]

    In: Proceedings of the Springer Nature 2025 LATEX template MindDiffuser27 IEEE/CVF International Conference on Computer Vision, pp

    Xu, X., Wang, Z., Zhang, G., Wang, K., Shi, H.: Versatile diffusion: Text, images and variations all in one diffusion model. In: Proceedings of the Springer Nature 2025 LATEX template MindDiffuser27 IEEE/CVF International Conference on Computer Vision, pp. 7754–7765 (2023)

  26. [26]

    Identifying natural images from human brain activity

    Kay, K.: Naselaris T, Prenger RJ, Gallant JL. Identifying natural images from human brain activity. nature452, 352–355 (2008)

  27. [27]

    Neuron 63(6), 902–915 (2009)

    Naselaris, T., Prenger, R.J., Kay, K.N., Oliver, M., Gallant, J.L.: Bayesian reconstruction of natural images from human brain activity. Neuron 63(6), 902–915 (2009)

  28. [28]

    Neural computation25(4), 979–1005 (2013)

    Fujiwara, Y., Miyawaki, Y., Kamitani, Y.: Modular encoding and decod- ing models derived from Bayesian canonical correlation analysis. Neural computation25(4), 979–1005 (2013)

  29. [29]

    NeuroImage254, 119121 (2022)

    Gaziv, G., Beliy, R., Granot, N., Hoogi, A., Strappini, F., Golan, T., Irani, M.: Self-supervised natural image reconstruction and large-scale semantic classification from brain activity. NeuroImage254, 119121 (2022)

  30. [30]

    et al., eds.; 2006)[book reviews]

    Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks 20(3), 542–542 (2009)

  31. [31]

    arXiv preprint arXiv:2211.06956 (2022)

    Chen, Z., Qing, J., Xiang, T., Yue, W.L., Zhou, J.H.: Seeing Beyond the Brain: Conditional Diffusion Model with Sparse Masked Modeling for Vision Decoding. arXiv preprint arXiv:2211.06956 (2022)

  32. [32]

    In: 2022 International Joint Conference on Neural Networks (IJCNN), pp

    Ozcelik, F., Choksi, B., Mozafari, M., Reddy, L., VanRullen, R.: Recon- struction of perceived images from fmri patterns and semantic brain exploration using instance-conditioned gans. In: 2022 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2022). IEEE

  33. [33]

    Advances in Neural Information Processing Systems34, 27517–27529 (2021)

    Casanova, A., Careil, M., Verbeek, J., Drozdzal, M., Romero Soriano, A.: Instance-Conditioned GAN. Advances in Neural Information Processing Systems34, 27517–27529 (2021)

  34. [34]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp

    Xia, W., de Charette, R., Oztireli, C., Xue, J.-H.: Dream: Visual decoding from reversing human visual system. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 8226–8235 (2024)

  35. [35]

    PLoS computational biology15(1), 1006633 (2019)

    Shen, G., Horikawa, T., Majima, K., Kamitani, Y.: Deep image recon- struction from human brain activity. PLoS computational biology15(1), 1006633 (2019)

  36. [36]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Simonyan, K.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) Springer Nature 2025 LATEX template 28MindDiffuser

  37. [37]

    ArXiv (2023)

    Kneeland, R., Ojeda, J., St-Yves, G., Naselaris, T.: Second sight: Using brain-optimized encoding models to align image distributions with human brain activity. ArXiv (2023)

  38. [38]

    In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

    Xie, D., Zhao, P., Zhang, J., Wei, K., Ni, X., Xia, J.: Brainram: Cross- modality retrieval-augmented image reconstruction from human brain activity. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 3994–4003 (2024)

  39. [39]

    Wang, Kendrick Kay, Thomas Naselaris, Michael J

    Wang, A.Y., Kay, K., Naselaris, T., Tarr, M.J., Wehbe, L.: Better models of human high-level visual cortex emerge from natural language super- vision with a large and diverse dataset. Nat Mach Intell 5, 1415–1426 (2023). https://doi.org/10.1038/s42256-023-00753-y

  40. [40]

    Advances in Neural Information Processing Systems (2024)

    Li, D., Wei, C., Li, S., Zou, J., Qin, H., Liu, Q.: Visual decoding and reconstruction via eeg embeddings with guided diffusion. Advances in Neural Information Processing Systems (2024)

  41. [41]

    International Conference on Learning Representations (2024)

    Song, Y., Liu, B., Li, X., Shi, N., Wang, Y., Gao, X.: Decoding natu- ral images from eeg for object recognition. International Conference on Learning Representations (2024)

  42. [42]

    In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Con- ference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Con- ference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241 (2015). Springer

  43. [43]

    Nature neuroscience25(1), 116–126 (2022)

    Allen, E.J., St-Yves, G., Wu, Y., Breedlove, J.L., Prince, J.S., Dowdle, L.T., Nau, M., Caron, B., Pestilli, F., Charest, I.,et al.: A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature neuroscience25(1), 116–126 (2022)

  44. [44]

    NeuroImage264, 119754 (2022)

    Gifford, A.T., Dwivedi, K., Roig, G., Cichy, R.M.: A large and rich eeg dataset for modeling human visual object recognition. NeuroImage264, 119754 (2022)

  45. [45]

    Elife12, 82580 (2023)

    Hebart, M.N., Contier, O., Teichmann, L., Rockter, A.H., Zheng, C.Y., Kidder, A., Corriveau, A., Vaziri-Pashkam, M., Baker, C.I.: Things-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior. Elife12, 82580 (2023)

  46. [46]

    In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp

    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L.: Microsoft COCO: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755 (2014). Springer Springer Nature 2025 LATEX template MindDiffuser29

  47. [47]

    International Conference on Learning Representations (2024)

    Benchetrit, Y., Banville, H., King, J.-R.: Brain decoding: toward real-time reconstruction of visual perception. International Conference on Learning Representations (2024)

  48. [48]

    IEEE transac- tions on image processing13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transac- tions on image processing13(4), 600–612 (2004)

  49. [49]

    Advances in neural information processing systems25(2012)

    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems25(2012)

  50. [50]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

  51. [51]

    In: International Conference on Machine Learning, pp

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learning transfer- able visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR

  52. [52]

    In: International Conference on Machine Learning, pp

    Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR

  53. [53]

    Advances in neural information processing systems33, 9912–9924 (2020)

    Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assign- ments. Advances in neural information processing systems33, 9912–9924 (2020)

  54. [54]

    Journal of neural engineering15(5), 056013 (2018)

    Lawhern, V.J., Solon, A.J., Waytowich, N.R., Gordon, S.M., Hung, C.P., Lance, B.J.: Eegnet: a compact convolutional neural network for eeg-based brain–computer interfaces. Journal of neural engineering15(5), 056013 (2018)

  55. [55]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)

  56. [56]

    IEEE Transactions on Medical Imaging, 1–1 (2025)

    Ma, Y., Liu, Y., Chen, L., Zhu, G., Chen, B., Zheng, N.: Brainclip: Brain representation via clip for generic natural visual stimulus decoding. IEEE Transactions on Medical Imaging, 1–1 (2025). https://doi.org/10.1109/ TMI.2025.3537287

  57. [57]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Quan, R., Wang, W., Tian, Z., Ma, F., Yang, Y.: Psychometry: An omnifit Springer Nature 2025 LATEX template 30MindDiffuser model for image reconstruction from human brain activity. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 233–243 (2024)

  58. [58]

    In: European Conference on Computer Vision, pp

    Xia, W., de Charette, R., Oztireli, C., Xue, J.-H.: Umbrae: Unified multi- modal brain decoding. In: European Conference on Computer Vision, pp. 242–259 (2024). Springer

  59. [59]

    Nature reviews neuroscience 13(6), 407–420 (2012)

    Buzs´ aki, G., Anastassiou, C.A., Koch, C.: The origin of extracellular fields and currents—eeg, ecog, lfp and spikes. Nature reviews neuroscience 13(6), 407–420 (2012)

  60. [60]

    Neuroimage 46(1), 168–176 (2009)

    Henson, R.N., Mattout, J., Phillips, C., Friston, K.J.: Selecting forward models for meg source-reconstruction using model-evidence. Neuroimage 46(1), 168–176 (2009)

  61. [61]

    Frontiers in Neuroinformatics9(2015)

    Gao, J.S., Huth, A.G., Lescroart, M.D., Gallant, J.L.: Pycortex: an inter- active surface visualizer for fMRI. Frontiers in Neuroinformatics9(2015). https://doi.org/10.3389/fninf.2015.00023

  62. [62]

    PLoS computational biology5(11), 1000579 (2009)

    Pinto, N., Doukhan, D., DiCarlo, J.J., Cox, D.D.: A high-throughput screening approach to discovering good forms of biologically inspired visual representation. PLoS computational biology5(11), 1000579 (2009)

  63. [63]

    Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)

  64. [64]

    Advances in neural information processing systems (2023) Yizhuo Lureceived the B.S

    Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., Isola, P.: Dreamsim: Learning new dimensions of human visual similarity using synthetic data. Advances in neural information processing systems (2023) Yizhuo Lureceived the B.S. degree in statistics from Beijing Institute of Technology, Beijing, China, in 2023. He is a Ph.D. degree candidate...