Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI

Carl Yang; Dawei Zhou; Elynn Chen; Enpei Zhang; Rex Ying; Weikang Qiu; Xiang Zhang; Yinghao Cai; Yujun Yan; Zheng Huang

arxiv: 2510.16196 · v2 · pith:QQY3VSTYnew · submitted 2025-10-17 · 💻 cs.CV · cs.AI

Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI

Zheng Huang , Enpei Zhang , Weikang Qiu , Yinghao Cai , Carl Yang , Elynn Chen , Xiang Zhang , Rex Ying

show 2 more authors

Dawei Zhou Yujun Yan

This is my paper

Pith reviewed 2026-05-18 05:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords fMRI decodingvisual reconstructionbrain-to-imagetext spaceobject-centric generationdiffusion modelscompositional representationattribute relationship

0 comments

The pith

fMRI signals align more closely with language model text spaces than vision or joint spaces, enabling better image reconstruction when text is structured around objects, attributes, and relationships.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines which latent space best bridges fMRI brain recordings to image generation for reconstructing what a person sees. It reports that fMRI activity matches text embeddings from language models more closely than it matches vision-only embeddings or combined text-image spaces. To exploit this match, the method restructures the text space so it explicitly represents scenes as collections of objects with specific attributes and relations between them. This structured approach is realized in a new model that first maps fMRI to text, then uses an object-centric diffusion process and an automatic search for the best attributes and relationships. Real-world tests show the resulting reconstructions reduce perceptual loss by as much as 8 percent over prior techniques.

Core claim

fMRI signals are more similar to the text space of a language model than to either a vision based space or a joint text image space. Text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. The proposed PRISM model projects fMRI signals into this structured text space as an intermediate representation, employs an object-centric diffusion module that composes individual objects to reduce detection errors, and uses an attribute relationship search module that automatically identifies the key attributes and relationships aligning with neural activity, yielding up to an

What carries the argument

PRISM, a model that projects fMRI signals into a structured text space as an intermediate representation for visual stimuli reconstruction, using an object-centric diffusion module to compose objects and an attribute-relationship search module to align with neural activity.

If this is right

Image reconstructions from fMRI improve when the intermediate representation explicitly encodes objects, attributes, and relationships rather than using raw vision or joint embeddings.
Object-centric composition in the diffusion stage reduces errors in detecting and placing individual elements within the generated scene.
Automatic search for the attributes and relationships that best match neural activity increases alignment between brain signals and the generative output.
Up to 8 percent lower perceptual loss is achieved on real-world fMRI datasets compared with earlier reconstruction pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the text-space preference holds across more datasets, training future generative models on richly compositional captions could produce outputs that more naturally match human visual representations.
The approach implies that fine-grained perceptual details missing from text descriptions may need supplementary non-text channels or richer attribute vocabularies to avoid information loss during reconstruction.
Extending the attribute-relationship search to handle dynamic or occluded scenes could test whether the current compositional structure scales to more complex visual stimuli.

Load-bearing premise

Mapping fMRI signals into a structured text space and then applying object-centric composition preserves the necessary visual information without dataset-specific overfitting or loss of fine-grained perceptual details that text attributes do not capture.

What would settle it

A side-by-side test on the same fMRI dataset that measures perceptual loss and object-detection accuracy when the generative model receives either the full structured text output or an equivalent unstructured text embedding of the same fMRI input.

read the original abstract

Understanding how the brain encodes visual information is a central challenge in neuroscience and machine learning. A promising approach is to reconstruct visual stimuli, essentially images, from functional Magnetic Resonance Imaging (fMRI) signals. This involves two stages: transforming fMRI signals into a latent space and then using a pretrained generative model to reconstruct images. The reconstruction quality depends on how similar the latent space is to the structure of neural activity and how well the generative model produces images from that space. Yet, it remains unclear which type of latent space best supports this transformation and how it should be organized to represent visual stimuli effectively. We present two key findings. First, fMRI signals are more similar to the text space of a language model than to either a vision based space or a joint text image space. Second, text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. Building on these insights, we propose PRISM, a model that Projects fMRI sIgnals into a Structured text space as an interMediate representation for visual stimuli reconstruction. It includes an object centric diffusion module that generates images by composing individual objects to reduce object detection errors, and an attribute relationship search module that automatically identifies key attributes and relationships that best align with the neural activity. Extensive experiments on real world datasets demonstrate that our framework outperforms existing methods, achieving up to an 8% reduction in perceptual loss. These results highlight the importance of using structured text as the intermediate space to bridge fMRI signals and image reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims fMRI aligns better with text embeddings than vision ones and builds PRISM around structured text plus object-centric composition, but the similarity result needs tighter controls and the gains need clearer validation.

read the letter

The main things to know are that the authors report fMRI signals from visual stimuli match language-model text spaces more closely than vision or joint spaces, and they introduce PRISM to project fMRI into structured text as an intermediate step, then use an object-centric diffusion module and an attribute-relationship search module to reconstruct images while handling compositionality of objects, attributes, and relations. The abstract states this yields up to an 8% drop in perceptual loss on real-world data.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that fMRI signals align more closely with the text embedding space of language models than with vision-only or joint text-image spaces, and that both text representations and the generative model must be adapted to capture the compositional structure of visual stimuli (objects, attributes, and relationships). Building on these findings, the authors introduce PRISM, which projects fMRI signals into a structured text space and employs an object-centric diffusion module together with an attribute-relationship search module to reconstruct images, reporting up to an 8% reduction in perceptual loss on real-world datasets.

Significance. If the two core findings are substantiated with appropriate controls, the work would supply a concrete, testable hypothesis about the representational format of visual information in the brain and a practical architecture that exploits that format for improved stimulus reconstruction.

major comments (2)

[§3.2] §3.2 (Similarity Analysis): the reported preference for language-model text space over vision and joint spaces is not accompanied by controls for embedding dimensionality, choice of base models, or the precise linear/non-linear mapping used to project fMRI vectors into each space; without these controls the observed alignment may be an artifact of capacity or alignment procedure rather than an intrinsic property of neural encoding.
[§5] §5 (Experimental Results): the central performance claim of an 8% perceptual-loss reduction is presented without tabulated baselines, statistical significance tests, error bars, or explicit data-exclusion criteria, rendering it impossible to evaluate whether the improvement is robust or dataset-specific.

minor comments (2)

[§4.3] The description of the attribute-relationship search module should specify the exact search objective and how it avoids overfitting to the same fMRI-image pairs used for evaluation.
[Figure 4] Figure 4 (reconstruction examples) would benefit from side-by-side comparison with the strongest published baseline on the same stimuli.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will implement to strengthen the work.

read point-by-point responses

Referee: [§3.2] §3.2 (Similarity Analysis): the reported preference for language-model text space over vision and joint spaces is not accompanied by controls for embedding dimensionality, choice of base models, or the precise linear/non-linear mapping used to project fMRI vectors into each space; without these controls the observed alignment may be an artifact of capacity or alignment procedure rather than an intrinsic property of neural encoding.

Authors: We acknowledge that additional controls are necessary to substantiate the claim that fMRI signals align more closely with text embeddings. In the revised version, we will conduct and report experiments that match embedding dimensionalities across spaces, evaluate multiple base models (e.g., different language models and vision transformers), and explicitly describe the projection method used. We will also test non-linear mappings to rule out procedural artifacts. These additions will clarify whether the preference is intrinsic to neural encoding. revision: yes
Referee: [§5] §5 (Experimental Results): the central performance claim of an 8% perceptual-loss reduction is presented without tabulated baselines, statistical significance tests, error bars, or explicit data-exclusion criteria, rendering it impossible to evaluate whether the improvement is robust or dataset-specific.

Authors: We agree that the experimental results section would benefit from more rigorous reporting. We will revise §5 to include a comprehensive table comparing PRISM against all baselines with mean perceptual loss values, add statistical significance tests (e.g., Wilcoxon signed-rank tests with p-values), include error bars representing standard deviation across subjects or runs in all relevant figures, and explicitly state the data exclusion criteria (such as removing trials with head motion exceeding a threshold). This will allow readers to better assess the robustness of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation rests on two empirical findings (fMRI-text similarity and need for compositional adaptation) that are presented as results from dataset experiments rather than definitions or self-citations. PRISM is then constructed on top of those findings, with performance claims tied to external benchmarks on real-world fMRI datasets. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text that would reduce the central claims to inputs by construction. The work is therefore treated as self-contained against external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the untested premise that text embeddings form a suitable bridge for fMRI data and that explicit object-attribute-relation decomposition improves reconstruction; no numerical free parameters are named in the abstract.

axioms (1)

domain assumption fMRI signals can be effectively projected into and aligned with a language-model text latent space
This premise underpins the entire PRISM pipeline and the first key finding.

invented entities (2)

object centric diffusion module no independent evidence
purpose: Generates images by composing individual objects to reduce object detection errors
New module introduced to handle compositional structure of visual stimuli
attribute relationship search module no independent evidence
purpose: Automatically identifies key attributes and relationships that best align with neural activity
New module introduced to adapt text representations to fMRI data

pith-pipeline@v0.9.0 · 5842 in / 1390 out tokens · 41009 ms · 2026-05-18T05:47:33.723860+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fMRI signals are more similar to the text space of a language model than to either a vision based space or a joint text image space
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

object-centric diffusion module that generates images by composing individual objects

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.