Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI
Pith reviewed 2026-05-18 05:47 UTC · model grok-4.3
The pith
fMRI signals align more closely with language model text spaces than vision or joint spaces, enabling better image reconstruction when text is structured around objects, attributes, and relationships.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
fMRI signals are more similar to the text space of a language model than to either a vision based space or a joint text image space. Text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. The proposed PRISM model projects fMRI signals into this structured text space as an intermediate representation, employs an object-centric diffusion module that composes individual objects to reduce detection errors, and uses an attribute relationship search module that automatically identifies the key attributes and relationships aligning with neural activity, yielding up to an
What carries the argument
PRISM, a model that projects fMRI signals into a structured text space as an intermediate representation for visual stimuli reconstruction, using an object-centric diffusion module to compose objects and an attribute-relationship search module to align with neural activity.
If this is right
- Image reconstructions from fMRI improve when the intermediate representation explicitly encodes objects, attributes, and relationships rather than using raw vision or joint embeddings.
- Object-centric composition in the diffusion stage reduces errors in detecting and placing individual elements within the generated scene.
- Automatic search for the attributes and relationships that best match neural activity increases alignment between brain signals and the generative output.
- Up to 8 percent lower perceptual loss is achieved on real-world fMRI datasets compared with earlier reconstruction pipelines.
Where Pith is reading between the lines
- If the text-space preference holds across more datasets, training future generative models on richly compositional captions could produce outputs that more naturally match human visual representations.
- The approach implies that fine-grained perceptual details missing from text descriptions may need supplementary non-text channels or richer attribute vocabularies to avoid information loss during reconstruction.
- Extending the attribute-relationship search to handle dynamic or occluded scenes could test whether the current compositional structure scales to more complex visual stimuli.
Load-bearing premise
Mapping fMRI signals into a structured text space and then applying object-centric composition preserves the necessary visual information without dataset-specific overfitting or loss of fine-grained perceptual details that text attributes do not capture.
What would settle it
A side-by-side test on the same fMRI dataset that measures perceptual loss and object-detection accuracy when the generative model receives either the full structured text output or an equivalent unstructured text embedding of the same fMRI input.
read the original abstract
Understanding how the brain encodes visual information is a central challenge in neuroscience and machine learning. A promising approach is to reconstruct visual stimuli, essentially images, from functional Magnetic Resonance Imaging (fMRI) signals. This involves two stages: transforming fMRI signals into a latent space and then using a pretrained generative model to reconstruct images. The reconstruction quality depends on how similar the latent space is to the structure of neural activity and how well the generative model produces images from that space. Yet, it remains unclear which type of latent space best supports this transformation and how it should be organized to represent visual stimuli effectively. We present two key findings. First, fMRI signals are more similar to the text space of a language model than to either a vision based space or a joint text image space. Second, text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. Building on these insights, we propose PRISM, a model that Projects fMRI sIgnals into a Structured text space as an interMediate representation for visual stimuli reconstruction. It includes an object centric diffusion module that generates images by composing individual objects to reduce object detection errors, and an attribute relationship search module that automatically identifies key attributes and relationships that best align with the neural activity. Extensive experiments on real world datasets demonstrate that our framework outperforms existing methods, achieving up to an 8% reduction in perceptual loss. These results highlight the importance of using structured text as the intermediate space to bridge fMRI signals and image reconstruction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that fMRI signals align more closely with the text embedding space of language models than with vision-only or joint text-image spaces, and that both text representations and the generative model must be adapted to capture the compositional structure of visual stimuli (objects, attributes, and relationships). Building on these findings, the authors introduce PRISM, which projects fMRI signals into a structured text space and employs an object-centric diffusion module together with an attribute-relationship search module to reconstruct images, reporting up to an 8% reduction in perceptual loss on real-world datasets.
Significance. If the two core findings are substantiated with appropriate controls, the work would supply a concrete, testable hypothesis about the representational format of visual information in the brain and a practical architecture that exploits that format for improved stimulus reconstruction.
major comments (2)
- [§3.2] §3.2 (Similarity Analysis): the reported preference for language-model text space over vision and joint spaces is not accompanied by controls for embedding dimensionality, choice of base models, or the precise linear/non-linear mapping used to project fMRI vectors into each space; without these controls the observed alignment may be an artifact of capacity or alignment procedure rather than an intrinsic property of neural encoding.
- [§5] §5 (Experimental Results): the central performance claim of an 8% perceptual-loss reduction is presented without tabulated baselines, statistical significance tests, error bars, or explicit data-exclusion criteria, rendering it impossible to evaluate whether the improvement is robust or dataset-specific.
minor comments (2)
- [§4.3] The description of the attribute-relationship search module should specify the exact search objective and how it avoids overfitting to the same fMRI-image pairs used for evaluation.
- [Figure 4] Figure 4 (reconstruction examples) would benefit from side-by-side comparison with the strongest published baseline on the same stimuli.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will implement to strengthen the work.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Similarity Analysis): the reported preference for language-model text space over vision and joint spaces is not accompanied by controls for embedding dimensionality, choice of base models, or the precise linear/non-linear mapping used to project fMRI vectors into each space; without these controls the observed alignment may be an artifact of capacity or alignment procedure rather than an intrinsic property of neural encoding.
Authors: We acknowledge that additional controls are necessary to substantiate the claim that fMRI signals align more closely with text embeddings. In the revised version, we will conduct and report experiments that match embedding dimensionalities across spaces, evaluate multiple base models (e.g., different language models and vision transformers), and explicitly describe the projection method used. We will also test non-linear mappings to rule out procedural artifacts. These additions will clarify whether the preference is intrinsic to neural encoding. revision: yes
-
Referee: [§5] §5 (Experimental Results): the central performance claim of an 8% perceptual-loss reduction is presented without tabulated baselines, statistical significance tests, error bars, or explicit data-exclusion criteria, rendering it impossible to evaluate whether the improvement is robust or dataset-specific.
Authors: We agree that the experimental results section would benefit from more rigorous reporting. We will revise §5 to include a comprehensive table comparing PRISM against all baselines with mean perceptual loss values, add statistical significance tests (e.g., Wilcoxon signed-rank tests with p-values), include error bars representing standard deviation across subjects or runs in all relevant figures, and explicitly state the data exclusion criteria (such as removing trials with head motion exceeding a threshold). This will allow readers to better assess the robustness of the reported improvements. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's derivation rests on two empirical findings (fMRI-text similarity and need for compositional adaptation) that are presented as results from dataset experiments rather than definitions or self-citations. PRISM is then constructed on top of those findings, with performance claims tied to external benchmarks on real-world fMRI datasets. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text that would reduce the central claims to inputs by construction. The work is therefore treated as self-contained against external validation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption fMRI signals can be effectively projected into and aligned with a language-model text latent space
invented entities (2)
-
object centric diffusion module
no independent evidence
-
attribute relationship search module
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
fMRI signals are more similar to the text space of a language model than to either a vision based space or a joint text image space
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
object-centric diffusion module that generates images by composing individual objects
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.