EmoMind: Decoding Affective Captions from Human Brain fMRI
Pith reviewed 2026-05-19 20:19 UTC · model grok-4.3
The pith
EmoMind decodes continuous 34-dimensional affect from fMRI to rewrite neutral scene descriptions into subject-specific affective captions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EmoMind retrieves a semantically grounded neutral scene description from brain-decoded visual features, then rewrites it with a continuous 34-dimensional emotion vector decoded from the same fMRI recording; the rewriter is trained with classifier-free guidance against an identity-preserving null branch to allow controllable interpolation between semantic fidelity and affective expressivity, and this yields captions that outperform label-prompted GPT-4 on subject-specificity, structural geometry, and causal control across two independent emotion fMRI datasets.
What carries the argument
The continuous 34-dimensional emotion vector decoded from fMRI, used inside a rewriter trained with classifier-free guidance against an identity-preserving null branch to modify neutral scene descriptions.
If this is right
- Continuous affect decoded from fMRI functions as a usable control signal for generating captions that reflect individual emotional responses rather than averaged categories.
- The three-axis validation framework measures subject-specific affective structure, structural geometry, and causal control in brain-to-text systems.
- A synthetic-brain substitution test checks whether the pipeline remains stable when the measurement apparatus changes.
- The largest performance gains appear on metrics that require person-specific affective structure instead of population-level aggregation.
Where Pith is reading between the lines
- The same continuous-vector approach could be tested for generating other personalized outputs such as image edits or dialogue responses that match a user's current brain state.
- The interpolation mechanism might be adapted to other brain-decoding tasks where a user wants adjustable levels of semantic versus stylistic control.
- The two-dataset result suggests the method could be applied to study how affective organisation differs across individuals in larger populations.
Load-bearing premise
That a 34-dimensional continuous vector extracted from fMRI accurately captures rich inter-subject affective variability and that classifier-free guidance produces smooth, artifact-free control over the balance of content and emotion.
What would settle it
A direct comparison in which EmoMind-generated captions show no higher correlation with individual subjects' fMRI patterns than captions produced from population-averaged emotion labels would falsify the claimed advantage in person-specific structure.
Figures
read the original abstract
Decoding visual experience from brain activity has advanced substantially, but cur- rent brain-to-text systems largely recover semantic content while discarding affect. Additionally, language models can generate emotional text when prompted with categorical labels, but such labels collapse rich inter-subject variability into coarse discrete bins. We present EmoMind, the first end-to-end pipeline for decoding affective captions directly from fMRI signals. EmoMind first retrieves a semanti- cally grounded neutral scene description from brain-decoded visual features, then rewrites it using a continuous 34-dimensional emotion vector decoded from the same fMRI recording. To control the balance between content preservation and affective expression, we train the rewriter with classifier-free guidance against an identity-preserving null branch, enabling smooth interpolation between semantic fidelity and affective expressivity. We evaluate affective caption generation with a three-axis validation framework spanning subject-specificity, structural geometry, and causal control. We further augment this framework with a synthetic-brain substitution test that probes robustness to the measurement apparatus, and we benchmark each axis against GPT-4 prompted with brain-decoded top-5 emotion labels as a strong discrete baseline. Across two independent emotion fMRI datasets, EmoMind significantly outperforms label-prompted GPT-4 on all three axes, with the largest gains on metrics that require person-specific affective structure rather than population-level emotion aggregation. These results establish continuous brain-decoded affect as a viable control signal for individualized affective cap- tion generation and open new directions for studying individual affective brain organisation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EmoMind, an end-to-end pipeline for decoding affective captions directly from fMRI signals. It first retrieves a semantically grounded neutral scene description from brain-decoded visual features, then rewrites the description using a continuous 34-dimensional emotion vector decoded from the same fMRI recording. Classifier-free guidance is applied against an identity-preserving null branch to interpolate between semantic fidelity and affective expressivity. Evaluation uses a three-axis framework (subject-specificity, structural geometry, causal control) plus a synthetic-brain substitution test, with benchmarks against GPT-4 prompted by brain-decoded top-5 emotion labels. The paper claims significant outperformance over this baseline across two independent emotion fMRI datasets, with largest gains on metrics requiring person-specific affective structure.
Significance. If the quantitative results and controls hold, this would constitute a meaningful advance in brain-to-text decoding by incorporating continuous, subject-specific affective signals rather than discrete categorical labels. It directly addresses the limitation that current systems discard affect and could enable more individualized affective caption generation while providing a new framework for studying inter-subject variability in affective brain organization.
major comments (2)
- [Abstract and §4] Abstract and §4 (Evaluation framework): The central claim of significant outperformance on all three axes lacks any reported quantitative metrics, error bars, statistical tests, or data exclusion criteria in the provided text. Without these, the reader cannot assess effect sizes or verify that gains are driven by person-specific structure rather than population-level aggregation.
- [Methods] Methods (emotion vector decoding): The 34-dimensional emotion vector is presented as decoded from fMRI, yet no derivation, fitting procedure, or validation against ground-truth affective variability is shown. This is load-bearing for the claim that it captures rich inter-subject differences and enables the reported gains over discrete labels.
minor comments (2)
- [Methods] Clarify the precise implementation of classifier-free guidance, including the identity-preserving null branch and how interpolation avoids artifacts.
- [Evaluation] Add explicit description of the synthetic-brain substitution test procedure and the specific robustness properties it measures.
Simulated Author's Rebuttal
We are grateful to the referee for their detailed and constructive review of our manuscript. Their comments highlight important areas for clarification and improvement in the presentation of our results and methods. Below, we provide point-by-point responses to the major comments and outline the revisions we plan to make.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Evaluation framework): The central claim of significant outperformance on all three axes lacks any reported quantitative metrics, error bars, statistical tests, or data exclusion criteria in the provided text. Without these, the reader cannot assess effect sizes or verify that gains are driven by person-specific structure rather than population-level aggregation.
Authors: We acknowledge that the abstract and §4 currently emphasize the outperformance claims without accompanying quantitative details such as specific metric values, error bars, or statistical tests. This omission makes it difficult for readers to fully evaluate the effect sizes and the contribution of person-specific structure. In the revised version, we will update §4 to include full reporting of all evaluation metrics with means ± standard deviations, 95% confidence intervals or error bars in visualizations, p-values from appropriate statistical tests (e.g., paired t-tests or permutation tests), and clear criteria for data exclusion (such as motion thresholds or signal quality checks). We will also add a new table summarizing the comparative results against the GPT-4 baseline to directly address concerns about population-level vs. subject-specific gains. revision: yes
-
Referee: [Methods] Methods (emotion vector decoding): The 34-dimensional emotion vector is presented as decoded from fMRI, yet no derivation, fitting procedure, or validation against ground-truth affective variability is shown. This is load-bearing for the claim that it captures rich inter-subject differences and enables the reported gains over discrete labels.
Authors: The referee is correct that the current Methods section does not provide sufficient detail on how the 34-dimensional emotion vector is derived from fMRI signals. To address this, we will expand the Methods with a new subsection titled 'Emotion Vector Decoding' that describes: (1) the source of the 34D vector (e.g., from validated affective rating scales like the Self-Assessment Manikin or similar), (2) the decoding model architecture and training procedure (e.g., voxel-wise linear regression with regularization, trained on subject-specific fMRI data), (3) the cross-validation scheme used for fitting, and (4) validation results showing correlation coefficients or prediction accuracy against ground-truth affective variability from independent ratings. This will demonstrate how the continuous vector preserves inter-subject differences that discrete labels cannot capture. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The abstract and available text describe an end-to-end pipeline that decodes a 34-dimensional emotion vector from fMRI to rewrite neutral scene descriptions, with classifier-free guidance for control. No equations, fitting procedures, or self-citations are shown that would reduce any claimed prediction, uniqueness, or result to its own inputs by construction. The three-axis validation framework and GPT-4 baseline are presented as external benchmarks, and the central claims rely on decoded signals rather than self-definitional or fitted-input reductions. The derivation is therefore self-contained against the described external evaluation.
Axiom & Free-Parameter Ledger
free parameters (1)
- 34-dimensional emotion vector
axioms (1)
- domain assumption fMRI signals contain sufficient information to decode a continuous 34D affective representation that generalizes across subjects for caption rewriting
Reference graph
Works this paper leans on
-
[1]
Tomoyasu Horikawa. Mind captioning: Evolving descriptive text of mental content from human brain activity.Science Advances, 2024
work page 2024
-
[2]
Jerry Tang, Amanda LeBel, Shailee Jain, and Alexander G Huth. Semantic reconstruction of continuous language from non-invasive brain recordings.Nature Neuroscience, 26(5):858–866, 2023
work page 2023
-
[3]
Paul S Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen, Aidan Dempster, Nathalie Verlinde, Elad Yundler, David Weisberg, Kenneth A Norman, and Tan- ishq Mathew Abraham. Reconstructing the mind’s eye: fMRI-to-image with contrastive learning and diffusion priors.Advances in Neural Information Processing Systems (NeurIPS), 36, 2024
work page 2024
-
[4]
CTRL: A Conditional Transformer Language Model for Controllable Generation
Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. CTRL: A conditional transformer language model for controllable generation.arXiv preprint arXiv:1909.05858, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[5]
Plug and play language models: A simple approach to controlled text generation
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. InInternational Conference on Learning Representations (ICLR), 2020
work page 2020
-
[6]
Stay on topic with classifier-free guidance.arXiv preprint arXiv:2306.17806, 2023
Guillaume Sanchez, Alexander Spangher, Honglu Fan, Elad Levi, Pawan Sasanka Ammana- manchi, and Stella Biderman. Stay on topic with classifier-free guidance.arXiv preprint arXiv:2306.17806, 2023
-
[7]
Carmen Cammarota et al. fMRI-to-image reconstruction with personalized visual-language alignment.Advances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[8]
Brain-inspired fMRI-to-text decoding via incremental and wrap-up language modeling
Wentao Lu, Dong Nie, Pengcheng Xue, Zheng Cui, Piji Li, Daoqiang Zhang, and Xuyun Wen. Brain-inspired fMRI-to-text decoding via incremental and wrap-up language modeling. In Advances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[9]
Nuwa Xi, Sendong Zhao, Haochun Wang, Chi Liu, Bing Qin, and Ting Liu. UniCoRN: Unified cognitive signal reconstruction bridging cognitive signals and human language.arXiv preprint arXiv:2307.05355, 2023
-
[10]
Xiaoyu Chen, Changde Du, Che Liu, Yizhe Wang, and Huiguang He. Open-vocabulary auditory neural decoding using fMRI-prompted LLM.arXiv preprint arXiv:2405.07840, 2024
-
[11]
Changde Du, Kaicheng Fu, Jinpeng Li, and Huiguang He. Decoding visual neural representa- tions by multimodal learning of brain-visual-linguistic features.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10760–10777, 2023
work page 2023
-
[12]
Rui Li, Yiting Wang, Wei-Long Zheng, and Bao-Liang Lu. A multi-view spectral-spatial- temporal masked autoencoder for decoding emotions with self-supervised learning. InProceed- ings of the 30th ACM International Conference on Multimedia, pages 6–14, 2022
work page 2022
-
[13]
Mind reader: Reconstructing complex images from brain activities
Sikun Lin, Thomas Sprague, and Ambuj K Singh. Mind reader: Reconstructing complex images from brain activities. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 29624–29636, 2022
work page 2022
-
[14]
arXiv preprint arXiv:2309.14030 , year=
Yiqun Duan, Jinzhao Zhou, Zhen Wang, Yu-Kai Wang, and Chin-Teng Lin. DeWave: Discrete EEG waves encoding for brain dynamics to text translation.arXiv preprint arXiv:2309.14030, 2023
-
[15]
Zheng Cui, Dong Nie, Pengcheng Xue, Xia Wu, Daoqiang Zhang, and Xuyun Wen. BrainX: A universal brain decoding framework with feature disentanglement and neuro-geometric repre- sentation learning. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 478–487, 2025
work page 2025
-
[16]
Philip A Kragel and Kevin S LaBar. Multivariate neural biomarkers of emotional states are categorically distinct.Social Cognitive and Affective Neuroscience, 10(11):1437–1448, 2015. 10
work page 2015
-
[17]
Discrete neural signatures of basic emotions.Cerebral Cortex, 26(6):2563–2573, 2016
Heini Saarimäki, Athanasios Gotsopoulos, Iiro P Jääskeläinen, Jouko Lampinen, Patrik Vuilleu- mier, Riitta Hari, Mikko Sams, and Lauri Nummenmaa. Discrete neural signatures of basic emotions.Cerebral Cortex, 26(6):2563–2573, 2016
work page 2016
-
[18]
Alan S Cowen and Dacher Keltner. Self-report captures 27 distinct categories of emotion bridged by continuous gradients.Proceedings of the National Academy of Sciences (PNAS), 114(38):E7900–E7909, 2017
work page 2017
-
[19]
Tomoyasu Horikawa, Alan S Cowen, Dacher Keltner, and Yukiyasu Kamitani. The neural representation of visually evoked emotion is high-dimensional, categorical, and distributed across transmodal brain regions.iScience, 23(5):101060, 2020. doi: 10.1016/j.isci.2020.101060
-
[20]
Emotionotopy in the human right temporo-parietal cortex.Nature Communications, 10(1):5568, 2019
Giada Lettieri, Giacomo Handjaras, Emiliano Ricciardi, Andrea Leo, Paolo Papale, Monica Betta, Pietro Pietrini, and Luca Cecchetti. Emotionotopy in the human right temporo-parietal cortex.Nature Communications, 10(1):5568, 2019
work page 2019
-
[21]
Retrieval-augmented generation for knowledge-intensive NLP tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020
work page 2020
-
[22]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehension. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020
work page 2020
-
[23]
Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis
Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. InInternational Conference on Machine Learning (ICML), 2018
work page 2018
-
[24]
Jongwan Kim, Svetlana V Shinkareva, and Douglas H Wedell. Representations of modality- general valence for videos and music derived from fMRI data.NeuroImage, 148:42–54, 2017
work page 2017
-
[25]
SentiCap: Generating image descriptions with sentiments
Alexander Mathews, Lexing Xie, and Xuming He. SentiCap: Generating image descriptions with sentiments. InProceedings of the AAAI Conference on Artificial Intelligence, 2016
work page 2016
-
[26]
StyleNet: Generating attractive visual captions with styles
Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. StyleNet: Generating attractive visual captions with styles. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
work page 2017
-
[27]
GeDi: Generative discriminator guided sequence generation
Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. GeDi: Generative discriminator guided sequence generation. InFindings of the Association for Computational Linguistics (EMNLP), 2021
work page 2021
-
[28]
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.NeurIPS Workshop on Deep Generative Models and Downstream Applications, 2021
work page 2021
-
[29]
Style transfer from non-parallel text by cross-alignment
Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. Style transfer from non-parallel text by cross-alignment. InAdvances in Neural Information Processing Systems (NeurIPS), 2017
work page 2017
-
[30]
Style transfer through back-translation
Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, and Alan W Black. Style transfer through back-translation. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 2018
work page 2018
-
[31]
Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, and Xing Xie. Large language models understand and can be enhanced by emotional stimuli.arXiv preprint arXiv:2307.11760, 2023
-
[32]
Marco Gozzi and Francesca Fallucchi. Emotional framing in prompts modulates large language model performance.Big Data and Cognitive Computing, 10(4):102, 2025. 11
work page 2025
-
[33]
Nikolaus Kriegeskorte, Marieke Mur, and Peter A Bandettini. Representational similarity analysis – connecting the branches of systems neuroscience.Frontiers in Systems Neuroscience, 2:4, 2008
work page 2008
-
[34]
Stéphane d’Ascoli, Jérémy Rapin, Yohann Benchetrit, Hubert Banville, and Jean-Rémi King. TRIBE: TRImodal brain encoder for whole-brain fMRI response prediction.arXiv preprint arXiv:2507.22229, 2025
-
[35]
Elenor Morgenroth, Stefano Moia, Laura Vilaclara, Raphael Fournier, Michal Muszynski, Maria Ploumitsakou, Marina Almató-Bellavista, Patrik Vuilleumier, and Dimitri Van De Ville. Emo-FilM: A multimodal dataset for affective neuroscience using naturalistic stimuli.Scientific Data, 12:684, 2025. 12 A Appendix A.1 Retrieval implementation details We pre-compu...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.