pith. sign in

arxiv: 2606.20696 · v1 · pith:VEF3OCPQnew · submitted 2026-06-15 · 💻 cs.CL · cs.AI· eess.AS

MindAlign: Decoding Inner Speech from fMRI Signals via Multimodal Embedding Alignment under Limited Data

Pith reviewed 2026-06-27 04:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AIeess.AS
keywords inner speech decodingfMRI brain signalsbrain-to-textmultimodal embedding alignmentsemantic sketchfrozen language modellimited training datacross-subject generalization
0
0 comments X

The pith

A two-stage alignment maps fMRI signals to semantic space so a frozen multimodal language model can generate text from inner speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MindAlign as a decoupled framework for decoding inner speech from fMRI under limited data and high inter-subject variability. It first builds a subject-specific mapping from neural activity into a shared multimodal semantic space to extract a latent sketch of the intended sentence. The second stage feeds this sketch plus visual context into an unchanged multimodal language model to produce free-form text. Experiments on silent image description data show the method beats fMRI-only and random baselines while the projection step transfers across subjects. The results support the view that fMRI signals carry semantic modulation that is separable from image priors.

Core claim

MindAlign learns a subject-specific neural-semantic alignment that maps fMRI activity into a shared multimodal semantic space, extracting a latent semantic sketch of the internally generated sentence. This sketch is integrated with visual context to prompt a frozen multimodal language model for free-form generation. The approach outperforms fMRI-only and random baselines on silent image description data and shows that the learned semantic-to-language projection generalizes across subjects when paired with subject-specific neural alignment, indicating that neural signals modulate semantic content beyond image-driven priors.

What carries the argument

A decoupled two-stage brain-to-language framework that first performs subject-specific neural-semantic alignment to produce a latent semantic sketch and then integrates it with a frozen multimodal language model.

If this is right

  • The method outperforms fMRI-only and random baselines on fMRI data from silent image description tasks.
  • The learned semantic-to-language projection generalizes across subjects when used with subject-specific neural alignment.
  • Neural signals modulate semantic content beyond what image-driven priors alone would supply.
  • The framework supports open-ended text generation without task-specific fine-tuning of the underlying language model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of neural alignment from language generation could let the same alignment step pair with different language models as they improve.
  • Cross-subject transfer of the projection step suggests semantic representations extracted from brain signals may have a common structure across individuals.
  • The design may extend to inner speech tasks without visual stimuli if the semantic sketch proves sufficiently independent of image context.

Load-bearing premise

The first-stage alignment extracts a latent semantic sketch from fMRI that is sufficiently independent of image-driven priors for the second stage to generate accurate free-form text from inner speech alone.

What would settle it

An experiment in which text generation accuracy on fMRI from inner speech drops to random baseline levels when visual context is withheld or when the alignment is tested on held-out subjects without retraining.

Figures

Figures reproduced from arXiv: 2606.20696 by Ichiro Kobayashi, Muxuan Liu, Satoshi Nishida.

Figure 1
Figure 1. Figure 1: Overview of MindAlign, our proposed two-stage brain-to-language decoding framework. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Decoded sentence examples (English translations) for two image–fMRI pairs. Colors [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Experimental procedure for the inner speech task. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall architecture of our two-stage brain-to-language decoding framework. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Histogram of token length distributions for each participant after LLaVA tokenization, [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative decoding examples. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
read the original abstract

Decoding inner speech from non-invasive brain signals remains a fundamental challenge due to the absence of overt linguistic output, limited training data, and large inter-subject variability. Existing brain-to-text approaches often rely on task-specific decoder fine-tuning, which restricts scalability and complicates adaptation to new participants. We propose MindAlign, a decoupled two-stage brain-to-language framework that enables open-ended text generation from fMRI signals without modifying the underlying language model. The first stage learns a subject-specific neural-semantic alignment that maps fMRI activity into a shared multimodal semantic space, extracting a latent semantic sketch of the internally generated sentence. The second stage integrates this sketch with visual context to prompt a frozen multimodal language model for free-form generation. Experiments on fMRI data collected during silent image description demonstrate that the proposed approach consistently outperforms fMRI-only and random baselines. We further show that the learned semantic-to-language projection can generalize across subjects, enabling effective decoding when paired with subject-specific neural alignment. These results indicate that neural signals modulate semantic content beyond image-driven priors, supporting a scalable and modular direction for brain-to-text decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MindAlign, a decoupled two-stage brain-to-text framework for decoding inner speech from fMRI. Stage 1 learns a subject-specific neural-semantic alignment mapping fMRI activity into a shared multimodal embedding space to produce a latent semantic sketch. Stage 2 combines this sketch with visual context to prompt a frozen multimodal language model for free-form text generation. Experiments on silent image-description fMRI data claim consistent outperformance over fMRI-only and random baselines plus cross-subject generalization of the semantic-to-language projection, supporting that neural signals add semantic content beyond image-driven priors.

Significance. If the central empirical claims hold after appropriate controls, the modular design (no LM fine-tuning, subject-specific alignment only) would address key scalability barriers in brain-to-text decoding under limited data and high inter-subject variability, offering a practical path toward open-ended inner-speech decoding.

major comments (2)
  1. [Experiments] Experiments section: no ablation is reported in which the first-stage mapper is trained solely on image embeddings (or image-driven priors) and then compared against the fMRI-derived sketch when both are fed to the frozen MLLM. Without this control, outperformance over the stated baselines and the Abstract claim that “neural signals modulate semantic content beyond image-driven priors” remain indistinguishable from residual image correlation in the fMRI voxels.
  2. [Abstract] Abstract and Experiments: the central claims of “consistent outperformance” and “cross-subject generalization” are stated without any quantitative metrics, dataset sizes, error bars, or baseline-construction details, leaving the load-bearing empirical support unverifiable from the provided text.
minor comments (2)
  1. [Abstract] Abstract does not specify the multimodal embedding space, the exact loss used for alignment, or how visual context is supplied to the MLLM.
  2. [Introduction] Notation for the “latent semantic sketch” and the “semantic-to-language projection” is introduced without an accompanying equation or diagram in the early sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each point below and will revise the manuscript to incorporate the suggested controls and clarifications.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: no ablation is reported in which the first-stage mapper is trained solely on image embeddings (or image-driven priors) and then compared against the fMRI-derived sketch when both are fed to the frozen MLLM. Without this control, outperformance over the stated baselines and the Abstract claim that “neural signals modulate semantic content beyond image-driven priors” remain indistinguishable from residual image correlation in the fMRI voxels.

    Authors: We agree that this ablation is required to rigorously support the claim that neural signals contribute semantic content beyond image-driven priors. In the revised manuscript we will add the requested control: the first-stage mapper will be trained on image embeddings alone (derived from the same visual stimuli used in the fMRI sessions) and the resulting sketches will be compared directly against fMRI-derived sketches when both are used to prompt the frozen MLLM. This will quantify any additional benefit provided by the neural data. revision: yes

  2. Referee: [Abstract] Abstract and Experiments: the central claims of “consistent outperformance” and “cross-subject generalization” are stated without any quantitative metrics, dataset sizes, error bars, or baseline-construction details, leaving the load-bearing empirical support unverifiable from the provided text.

    Authors: The Experiments section of the full manuscript already reports dataset sizes (number of subjects and trials), quantitative metrics with error bars obtained via cross-validation, and explicit baseline-construction procedures. To improve verifiability from the abstract itself, we will revise the abstract to include the key numerical results (e.g., relative improvements over baselines and cross-subject generalization performance). revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The abstract and available description present a two-stage empirical framework (neural-semantic alignment followed by prompting a frozen MLLM) whose performance claims rest on outperforming fMRI-only and random baselines in silent image-description fMRI data. No equations, parameter-fitting procedures, self-citations, or uniqueness theorems are referenced that would reduce any claimed prediction or result to an input quantity by construction. The central claim that neural signals modulate semantic content beyond image priors is presented as an empirical observation rather than a definitional or fitted tautology, rendering the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a shared multimodal semantic space that can be aligned to fMRI and the assumption that subject-specific mappings can be learned from limited data without contaminating the semantic sketch with image priors. No free parameters or invented entities are explicitly named in the abstract.

axioms (1)
  • domain assumption A shared multimodal semantic space exists that can align fMRI activity, language, and visual context.
    Invoked in the description of the first stage that maps fMRI into this space.

pith-pipeline@v0.9.1-grok · 5731 in / 1270 out tokens · 26012 ms · 2026-06-27T04:08:47.798475+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 7 canonical work pages

  1. [1]

    Reconstructing the Mind's Eye: f

    Paul Steven Scotti and Atmadeep Banerjee and Jimmie Goode and Stepan Shabalin and Alex Nguyen and Cohen Ethan and Aidan James Dempster and Nathalie Verlinde and Elad Yundler and David Weisberg and Kenneth Norman and Tanishq Mathew Abraham , booktitle=. Reconstructing the Mind's Eye: f. 2023 , url=

  2. [2]

    Norman and Tanishq Mathew Abraham , booktitle=

    Paul Steven Scotti and Mihir Tripathy and Cesar Torrico and Reese Kneeland and Tong Chen and Ashutosh Narang and Charan Santhirasegaran and Jonathan Xu and Thomas Naselaris and Kenneth A. Norman and Tanishq Mathew Abraham , booktitle=. MindEye2: Shared-Subject Models Enable f. 2024 , url=

  3. [3]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Mindbridge: A cross-subject brain decoding framework , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  4. [4]

    Weikang Qiu and Zheng Huang and Haoyu Hu and Aosong Feng and Yujun Yan and Rex Ying , booktitle=. Mind. 2025 , url=

  5. [5]

    Nature communications , volume=

    Toward a universal decoder of linguistic meaning from brain activation , author=. Nature communications , volume=. 2018 , publisher=

  6. [6]

    arXiv preprint arXiv:2405.07840 , year=

    Open-vocabulary Auditory Neural Decoding Using fMRI-prompted LLM , author=. arXiv preprint arXiv:2405.07840 , year=

  7. [7]

    Nature neuroscience , volume=

    A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence , author=. Nature neuroscience , volume=. 2022 , publisher=

  8. [8]

    Findings of the Association for Computational Linguistics: NAACL 2025 , pages =

    Mishra, Abhijit and Shukla, Shreya and Torres, Jose and Gwizdka, Jacek and Roychowdhury, Shounak , title =. Findings of the Association for Computational Linguistics: NAACL 2025 , pages =. 2025 , url =

  9. [9]

    Communications Biology , volume=

    Generative language reconstruction from brain recordings , author=. Communications Biology , volume=. 2025 , publisher=

  10. [10]

    Visual Decoding and Reconstruction via

    Dongyang Li and Chen Wei and Shiying Li and Jiachen Zou and Quanying Liu , booktitle=. Visual Decoding and Reconstruction via. 2024 , url=

  11. [11]

    IEEE Transactions on Medical Imaging , year =

    Ma, Yongqiang and Liu, Yulong and Chen, Liangjun and Zhu, Guibo and Chen, Badong and Zheng, Nanning , title =. IEEE Transactions on Medical Imaging , year =

  12. [12]

    EMNLP , year=

    The Power of Scale for Parameter-Efficient Prompt Tuning , author=. EMNLP , year=

  13. [13]

    arXiv preprint arXiv:1807.03748 , year=

    Representation learning with contrastive predictive coding , author=. arXiv preprint arXiv:1807.03748 , year=

  14. [14]

    Information processing & management , volume=

    Term-weighting approaches in automatic text retrieval , author=. Information processing & management , volume=. 1988 , publisher=

  15. [15]

    Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

    Reimers, Nils and Gurevych, Iryna. Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.365

  16. [16]

    2020 , eprint=

    BERTScore: Evaluating Text Generation with BERT , author=. 2020 , eprint=

  17. [17]

    , author=

    String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. , author=. 1990 , publisher=

  18. [18]

    and Gurevych, I

    Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410

  19. [19]

    ICML , year=

    Learning transferable visual models from natural language supervision , author=. ICML , year=

  20. [20]

    Nature Neuroscience , volume=

    Semantic reconstruction of continuous language from non-invasive brain recordings , author=. Nature Neuroscience , volume=. 2023 , doi=

  21. [21]

    arXiv preprint arXiv:2406.07584 , year=

    BrainChat: Decoding Semantic Information from fMRI using Vision-language Pretrained Models , author=. arXiv preprint arXiv:2406.07584 , year=

  22. [22]

    arXiv preprint arXiv:2405.17720 , year=

    MindFormer: Semantic Alignment of Multi-Subject fMRI for Brain Decoding , author=. arXiv preprint arXiv:2405.17720 , year=

  23. [23]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

    UniCoRN: Unified Cognitive Signal ReconstructioN bridging cognitive signals and human language , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

  24. [24]

    arXiv preprint arXiv:2409.19710 , year=

    A multimodal LLM for the non-invasive decoding of spoken text from brain recordings , author=. arXiv preprint arXiv:2409.19710 , year=

  25. [25]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  26. [26]

    Findings of the Association for Computational Linguistics: ACL 2022 , pages=

    Cross-Modal Cloze Task: A New Task to Brain-to-Word Decoding , author=. Findings of the Association for Computational Linguistics: ACL 2022 , pages=. 2022 , organization=

  27. [27]

    A Call for Clarity in Reporting BLEU Scores

    Post, Matt. A Call for Clarity in Reporting BLEU Scores. Proceedings of the Third Conference on Machine Translation: Research Papers. 2018

  28. [28]

    Soviet physics doklady , volume=

    Binary codes capable of correcting deletions, insertions, and reversals , author=. Soviet physics doklady , volume=. 1966 , organization=

  29. [29]

    Frontiers in Human Neuroscience , volume=

    Brain Decoding-Classification of Hand Written Digits from fMRI Data Employing Bayesian Networks , author=. Frontiers in Human Neuroscience , volume=. 2016 , publisher=

  30. [30]

    Proceedings of the Cognitive Computational Neuroscience Conference (CCN) , year=

    Visual Feature-Based Brain Decoding Yields Weight Maps Better Aligned with Scene Understanding than Classification , author=. Proceedings of the Cognitive Computational Neuroscience Conference (CCN) , year=

  31. [31]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=

  32. [32]

    International Conference on Learning Representations , year=

    BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=

  33. [33]

    ROUGE : A Package for Automatic Evaluation of Summaries

    Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

  34. [34]

    chr F : character n-gram F -score for automatic MT evaluation

    Popovi \'c , Maja. chr F : character n-gram F -score for automatic MT evaluation. Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015. doi:10.18653/v1/W15-3049

  35. [35]

    Bleu: a method for automatic evaluation of machine translation

    Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

  36. [36]

    The Thirteenth International Conference on Learning Representations , year=

    Toward Generalizing Visual Brain Decoding to Unseen Subjects , author=. The Thirteenth International Conference on Learning Representations , year=

  37. [37]

    2014 , eprint=

    Microsoft COCO: Common Objects in Context , author=. 2014 , eprint=

  38. [38]

    arXiv preprint arXiv:2501.02570 , year=

    Decoding fMRI Data into Captions using Prefix Language Modeling , author=. arXiv preprint arXiv:2501.02570 , year=

  39. [39]

    arXiv preprint arXiv:2101.00190 , year=

    Prefix-tuning: Optimizing continuous prompts for generation , author=. arXiv preprint arXiv:2101.00190 , year=

  40. [40]

    Communications biology , volume=

    Small sample sizes reduce the replicability of task-based fMRI studies , author=. Communications biology , volume=. 2018 , publisher=

  41. [41]

    2016 , publisher=

    Deep Learning , author=. 2016 , publisher=

  42. [42]

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , year=

    Sentence-BERT: Sentence embeddings using Siamese BERT-networks , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , year=

  43. [43]

    Neural computation , volume=

    Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

  44. [44]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  45. [45]

    Publications Manual , year = "1983", publisher =

  46. [46]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  47. [47]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  48. [48]

    Dan Gusfield , title =. 1997

  49. [49]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  50. [50]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  51. [51]

    2025 , note =

    Florian Reichel , title =. 2025 , note =

  52. [52]

    Frontiers in Computational Neuroscience , volume=

    Measuring the performance of neural models , author=. Frontiers in Computational Neuroscience , volume=

  53. [53]

    Advances in Neural Information Processing Systems , volume=

    How linear are auditory cortical responses? , author=. Advances in Neural Information Processing Systems , volume=

  54. [54]

    Network: Computation in Neural Systems , volume=

    Quantifying variability in neural responses and its application for the validation of model predictions , author=. Network: Computation in Neural Systems , volume=

  55. [55]

    Huth and Gallant Lab , title =

    Alexander G. Huth and Gallant Lab , title =

  56. [56]

    Journal of big data , volume=

    A survey on image data augmentation for deep learning , author=. Journal of big data , volume=. 2019 , publisher=

  57. [57]

    NeuroImage: Clinical , volume=

    Data augmentation with Mixup: Enhancing performance of a functional neuroimaging-based prognostic deep learning classifier in recent onset psychosis , author=. NeuroImage: Clinical , volume=. 2022 , publisher=

  58. [58]

    arXiv preprint arXiv:1710.09412 , year=

    mixup: Beyond empirical risk minimization , author=. arXiv preprint arXiv:1710.09412 , year=

  59. [59]

    arXiv preprint arXiv:2502.13606 , year=

    LaVCa: LLM-assisted Visual Cortex Captioning , author=. arXiv preprint arXiv:2502.13606 , year=

  60. [60]

    International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=

    Interpretable fMRI Captioning via Contrastive Learning , author=. International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=. 2025 , organization=

  61. [61]

    arXiv preprint arXiv:2505.01670 , year=

    Efficient Multi Subject Visual Reconstruction from fMRI Using Aligned Representations , author=. arXiv preprint arXiv:2505.01670 , year=

  62. [62]

    Science Advances , volume=

    Mind captioning: Evolving descriptive text of mental content from human brain activity , author=. Science Advances , volume=. 2025 , publisher=

  63. [63]

    arXiv preprint arXiv:1508.01991 , year=

    Bidirectional LSTM-CRF models for sequence tagging , author=. arXiv preprint arXiv:1508.01991 , year=

  64. [64]

    Rethinking Cross-Subject Data Splitting for Brain-to-Text Decoding

    Yin, Congchi and Yu, Qian and Fang, Zhiwei and Peng, Changping and Li, Piji. Rethinking Cross-Subject Data Splitting for Brain-to-Text Decoding. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.289

  65. [65]

    Advances in Neural Information Processing Systems , year =

    Brain-Inspired fMRI-to-Text Decoding via Incremental and Wrap-Up Language Modeling , author =. Advances in Neural Information Processing Systems , year =

  66. [66]

    arXiv preprint arXiv:2502.17480 , year =

    Brain-to-Text Decoding: A Non-invasive Approach via Typing , author =. arXiv preprint arXiv:2502.17480 , year =. 2502.17480 , archivePrefix =

  67. [67]

    arXiv preprint arXiv:2511.21740 , year =

    A Cross-Species Neural Foundation Model for End-to-End Speech Decoding , author =. arXiv preprint arXiv:2511.21740 , year =. 2511.21740 , archivePrefix =

  68. [68]

    Journal of Urban Management , volume =

    Hmamouche, Youssef and Chihab, Ismail and Kdouri, Lahoucine and. Journal of Urban Management , volume =. 2026 , publisher =. doi:10.1016/j.jum.2025.09.002 , url =