pith. sign in

arxiv: 2506.08277 · v3 · pith:YGUE7VIUnew · submitted 2025-06-09 · 🧬 q-bio.NC · cs.AI· cs.CL· cs.CV· cs.LG

Task-conditioned probing of instruction-tuned multimodal LLMs: Region-specific brain alignment patterns under naturalistic stimuli

Pith reviewed 2026-05-21 23:58 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.AIcs.CLcs.CVcs.LG
keywords instruction-tuned MLLMsbrain alignmentfMRInaturalistic stimulitask conditioningmultimodal LLMsvideo encodingregion-specific patterns
0
0 comments X

The pith

Instruction-tuned multimodal LLMs produce task-specific representations that align more strongly with brain activity than other models during naturalistic movie watching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether instruction-tuning in multimodal large language models leads to representations organized around task demands that match human brain activity. By predicting fMRI signals from participants watching movies with audio using embeddings from various MLLMs under different instructions, it finds that instruction-tuned video models outperform others. This suggests that task conditioning helps these models capture functional aspects similar to brain processing rather than just surface-level semantics. The results show variations in alignment across brain regions and differences in semantic coupling between model types.

Core claim

The authors demonstrate that instruction-tuned video MLLMs, when using instruction-specific embeddings across 13 video task instructions, achieve higher prediction accuracy for fMRI responses recorded during naturalistic movie watching compared to ICL multimodal models by approximately 9%, non-instruction-tuned multimodal models by 15%, and unimodal baselines by 20%. They observe that these models generate distinct task-specific representations that vary across brain regions, with IT models showing weak coupling to instruction-text semantics unlike ICL models which show strong semantic organization.

What carries the argument

Instruction-specific embeddings from instruction-tuned MLLMs under task-conditioned probing to predict voxel-wise fMRI responses.

If this is right

  • Task-conditioned subspaces in IT models associate with higher brain alignment.
  • IT models show weak coupling to instruction-text semantics (r=0.14) compared to strong coupling in ICL models (r=0.78).
  • Distinct task-specific representations emerge across brain regions for video and audio tasks.
  • Findings indicate an association between task-specific instructions and stronger brain-MLLM alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Task conditioning in AI models could be used to create systems that more closely resemble human brain information processing.
  • This approach might allow better mapping of how the brain integrates multimodal information in real-world scenarios.
  • Testing the method on non-movie naturalistic stimuli could reveal if the advantages are specific to certain types of experiences.

Load-bearing premise

Differences in alignment arise specifically from instruction-tuning and task conditioning rather than from uncontrolled differences in model scale, pretraining data, or layer selection.

What would settle it

A direct comparison of instruction-tuned and non-tuned models matched for scale, pretraining data, and architecture that shows no alignment difference would falsify the role of instruction-tuning.

Figures

Figures reproduced from arXiv: 2506.08277 by Bapi S. Raju, Khushbu Pahwa, Maneesh Singh, Manish Gupta, Prachi Jindal, Satya Sai Srinath Namburi, Subba Reddy Oota, Tanmoy Chakraborty.

Figure 1
Figure 1. Figure 1: Leveraging instruction-tuned multimodal video and audio models for brain encoding [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average normalized brain alignment of instruction-tuned video MLLMs vs instruction [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Each voxel is color-coded with the instruction that led to the highest normalized brain [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Qwen-2.5-VL and (b) Qwen-2.5-Audio (layer-wise alignment): Each voxel is color [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Semantic Task Group Analysis: Each voxel is color coded with the task instruction that led [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Flattened cortical surfaces for language-, visual- and auditory-selective regions displayed [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Estimated cross-subject prediction accuracy for all four participants for the naturalistic [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Average normalized brain alignment of instruction-tuned video MLLMs vs instruction [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Average normalized brain alignment of instruction-tuned video MLLMs vs instruction [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qwen-2.5-VL vs. TVLT: Contrast of estimated cross-subject prediction accuracy for [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: InstructBLIPVideo vs. TVLT: Contrast of estimated cross-subject prediction accuracy [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Video-LLaVA vs. TVLT: Contrast of estimated cross-subject prediction accuracy for [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: LLaVA-NeXT-Video vs. TVLT: Contrast of estimated cross-subject prediction accuracy [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qwen-Audio vs. TVLT: Contrast of estimated cross-subject prediction accuracy for [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Kimi-Audio vs. TVLT: Contrast of estimated cross-subject prediction accuracy for [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Each voxel is color coded with the instruction (out of 13) that led to the highest nor [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Each voxel is color coded with the instruction (out of 13) that led to the highest normalized [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Kimi-Audio: Each voxel is color-coded with the instruction (out of 5) that led to the [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Each voxel is color coded with the video MLLM layer number (out of 33) that led to the [PITH_FULL_IMAGE:figures/full_fig_p036_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Semantic Task Group Analysis: Each voxel is color coded with the task instruction that led [PITH_FULL_IMAGE:figures/full_fig_p037_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Share variance of video tasks: The voxels are projected onto the flattened cortical surface [PITH_FULL_IMAGE:figures/full_fig_p037_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Shared and Unique Variance: Narrative Understanding vs. Linking Events Dark orange [PITH_FULL_IMAGE:figures/full_fig_p038_22.png] view at source ↗
read the original abstract

Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models. More recently, instruction-tuned multimodal (IT) models have been shown to generate task-specific representations that align strongly with brain activity, yet most prior evaluations focus on unimodal stimuli or non-instruction-tuned models under multimodal stimuli. We still lack a clear understanding of whether instruction-tuning is associated with IT-MLLMs organizing their representations around functional task demands or if they simply reflect surface semantics. To address this, we estimate brain alignment by predicting fMRI responses recorded during naturalistic movie watching (video with audio) from MLLM representations. Using instruction-specific embeddings from six video and two audio IT-MLLMs, across 13 video task instructions, we find that instruction-tuned video MLLMs show higher brain alignment than in-context learning (ICL) multimodal models (~9%), non-instruction-tuned multimodal models (~15%), and unimodal baselines (~20%). Our evaluation of MLLMs across video and audio tasks, and language-guided probing produces distinct task-specific MLLM representations that vary across brain regions. We also find that ICL models show strong semantic organization (r=0.78), while IT models show weak coupling to instruction-text semantics (r=0.14), consistent with task-conditioned subspaces associated with higher brain alignment. These findings are consistent with an association between task-specific instructions and stronger brain-MLLM alignment, and open new avenues for mapping joint information processing in both systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that instruction-tuned video MLLMs produce task-conditioned representations that align more strongly with voxel-wise fMRI responses recorded during naturalistic movie watching (video+audio) than ICL multimodal models (~9% higher), non-instruction-tuned multimodal models (~15%), or unimodal baselines (~20%). Using instruction-specific embeddings from six video and two audio IT-MLLMs across 13 task instructions, it reports that IT models exhibit weak coupling to instruction-text semantics (r=0.14) while ICL models show strong semantic organization (r=0.78), interpreting the pattern as evidence that task subspaces rather than surface semantics drive the alignment advantage. Code is released publicly.

Significance. If the central attribution to instruction-tuning survives controls, the result would indicate that IT-MLLMs organize representations around functional task demands in a manner that better matches human brain activity under naturalistic conditions, extending prior multimodal brain-encoding work and suggesting a concrete link between tuning objectives and cross-system information processing. Public code is a clear reproducibility strength.

major comments (2)
  1. [Results and Methods] Results (quantitative gains paragraph) and Methods (model comparison setup): the ~9–20% alignment advantages are reported without matching or covariate adjustment on parameter count, pretraining corpus size/composition, or the precise layers used for embedding extraction across IT-MLLMs versus ICL/non-IT/unimodal baselines. This is load-bearing for the claim that differences arise specifically from instruction-tuning and task conditioning rather than scale or data confounds.
  2. [Evaluation] Evaluation section (task-specific embeddings and probing): instruction-specific embeddings are used only for the IT models, yet no ablation or layer-wise analysis is described that isolates the contribution of tuning from the choice of embedding source or from uncontrolled architectural differences. Without this, the task-conditioned subspace interpretation cannot be cleanly separated from the listed confounds.
minor comments (2)
  1. [Abstract] Abstract: specify the exact statistical tests, correction procedures, and number of subjects/voxels underlying the reported percentage gains and correlation values (r=0.78, r=0.14).
  2. [Methods] Throughout: provide a supplementary table listing all compared models with parameter counts, layer indices, and pretraining details to support the comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the attribution of our findings to instruction-tuning. We address each major comment below with specific plans for revision where appropriate.

read point-by-point responses
  1. Referee: [Results and Methods] Results (quantitative gains paragraph) and Methods (model comparison setup): the ~9–20% alignment advantages are reported without matching or covariate adjustment on parameter count, pretraining corpus size/composition, or the precise layers used for embedding extraction across IT-MLLMs versus ICL/non-IT/unimodal baselines. This is load-bearing for the claim that differences arise specifically from instruction-tuning and task conditioning rather than scale or data confounds.

    Authors: We agree that explicit controls for scale and data composition would strengthen causal attribution to instruction-tuning. Our design compared multiple representative models per category (six IT video MLLMs plus two audio IT-MLLMs against ICL, non-IT multimodal, and unimodal baselines) to reduce reliance on any single model, and the alignment advantage was consistent across this set. However, we did not perform formal matching or regression-based covariate adjustment. In revision we will add a supplementary table listing approximate parameter counts, known pretraining corpus characteristics, and the exact layer(s) chosen for each model, with justification drawn from prior brain-encoding literature on intermediate-layer optimality. We will also expand the Limitations section to note that residual scale confounds cannot be fully ruled out without larger-scale controlled experiments. revision: yes

  2. Referee: [Evaluation] Evaluation section (task-specific embeddings and probing): instruction-specific embeddings are used only for the IT models, yet no ablation or layer-wise analysis is described that isolates the contribution of tuning from the choice of embedding source or from uncontrolled architectural differences. Without this, the task-conditioned subspace interpretation cannot be cleanly separated from the listed confounds.

    Authors: Instruction-specific embeddings are the natural input for IT models given their training objective; for ICL and non-IT models we followed standard embedding protocols appropriate to each architecture, as stated in Methods. We acknowledge the absence of a dedicated ablation that fully disentangles instruction-tuning from architectural variation. In the revised manuscript we will include a new layer-wise analysis for a representative subset of models (one IT, one ICL, one non-IT) showing that the alignment advantage persists across multiple layers rather than being confined to a single extraction point. We will also emphasize that the observed dissociation in semantic coupling (r=0.14 for IT vs. r=0.78 for ICL) provides convergent evidence that the alignment gain is not driven by surface-level instruction semantics alone. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical comparisons of brain alignment are self-contained

full rationale

The paper reports direct empirical measurements of voxel-wise fMRI prediction accuracy using embeddings extracted from various MLLM classes under naturalistic video stimuli. The central claim of higher alignment (~9-20%) for instruction-tuned models is presented as an observed outcome from cross-model comparisons, not as a quantity derived from an equation or parameter that is itself defined by the alignment result. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text; the evaluation relies on external model families and standard correlation metrics without reducing the task-conditioned advantage to a tautology. The derivation chain consists of standard encoding-model fitting followed by group-level statistical comparison and is therefore independent of the reported findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard domain assumptions from brain encoding literature rather than new free parameters or invented entities introduced by this work.

axioms (1)
  • domain assumption Linear regression models suffice to map MLLM embeddings to voxel-wise fMRI responses.
    Implicit in the estimation of brain alignment by predicting fMRI responses from model representations.

pith-pipeline@v0.9.0 · 5892 in / 1221 out tokens · 64664 ms · 2026-05-21T23:58:37.428002+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 6 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, August

  2. [2]

    Mae-ast: Masked autoencoding audio spectrogram transformer

    Alan Baade, Puyuan Peng, and David Harwath. Mae-ast: Masked autoencoding audio spectrogram transformer. Interspeech 2022,

  3. [3]

    Qwen2-Audio Technical Report

    URL https://publications.polymtl.ca/50613/. 10 Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759,

  4. [4]

    What can 1.8 billion regressions tell us about the pressures shaping high-level visual representation in brains and machines? bioRxiv, pp

    Colin Conwell, Jacob S Prince, Kendrick N Kay, George A Alvarez, and Talia Konkle. What can 1.8 billion regressions tell us about the pressures shaping high-level visual representation in brains and machines? bioRxiv, pp. 2022–03,

  5. [5]

    Visual representations in the human brain are aligned with large language models

    Adrien Doerig, Tim C Kietzmann, Emily Allen, Yihan Wu, Thomas Naselaris, Kendrick Kay, and Ian Charest. Semantic scene descriptions as an objective of human vision. arXiv preprint arXiv:2209.11737,

  6. [6]

    Interpreting multimodal video transformers using brain recordings

    Dota Tianai Dong and Mariya Toneva. Interpreting multimodal video transformers using brain recordings. In ICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls, 2023a. Dota Tianai Dong and Mariya Toneva. Vision-language integration in multimodal video transformers (partially) aligns with the brain. arXiv preprint arXiv:2311.07766, 2...

  7. [7]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

  8. [8]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning. arXiv preprint arXiv:2504.06958,

  9. [9]

    Video-llava: Learning united visual representation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 5971–5984,

  10. [10]

    Video-chatgpt: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024),

  11. [11]

    Yuko Nakagi, Takuya Matsuyama, Naoko Koide-Majima, Hiroto Yamaguchi, Rieko Kubo, Shinji Nishimoto, and Yu Takagi

    URL https://openreview.net/forum?id=KL8Sm4xRn7. Yuko Nakagi, Takuya Matsuyama, Naoko Koide-Majima, Hiroto Yamaguchi, Rieko Kubo, Shinji Nishimoto, and Yu Takagi. Unveiling multi-level and multi-modal semantic representations in the human brain using large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Proce...

  12. [12]

    The cost of compression: Investigating the impact of compression on parametric knowledge in language models

    Satya Sai Srinath Namburi, Makesh Sreedhar, Srinath Srinivasan, and Frederic Sala. The cost of compression: Investigating the impact of compression on parametric knowledge in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December

  13. [13]

    URL https://aclanthology

    Association for Computational Linguistics. URL https://aclanthology. org/2023.findings-emnlp.349/. Subba Reddy Oota, Jashn Arora, Veeral Agarwal, Mounika Marreddy, Manish Gupta, and Bapi Surampudi. Neural language taskonomy: Which nlp tasks are the most predictive of fmri brain activity? In Proceedings of the 2022 Conference of the North American Chapter ...

  14. [14]

    Tuning in to neural encoding: Linking human brain and artificial supervised representations of language

    Jingyuan Sun, Xiaohan Zhang, and Marie-Francine Moens. Tuning in to neural encoding: Linking human brain and artificial supervised representations of language. In ECAI 2023, pp. 2258–2265. IOS Press,

  15. [15]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

  16. [16]

    Brain- wavlm: Fine-tuning speech representations with brain responses to language

    Nishitha Vattikonda, Aditya R Vaidya, Richard J Antonello, and Alexander G Huth. Brain- wavlm: Fine-tuning speech representations with brain responses to language. arXiv preprint arXiv:2502.08866,

  17. [17]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191,

  18. [18]

    URL https://llava-vl.github.io/blog/2024-04-30-llava-next-video/ . 14 Overview of Appendix Sections • Appendix A: Overview of multimodal model evaluation settings in brain encoding studies • Appendix B: Related work • Appendix C: Detailed sub-ROIs of language, visual and auditory regions • Appendix D: Cross-subject prediction accuracy • Appendix E: Model ...

  19. [19]

    to predict one participant’s response from others. For every combination, one participant was randomly chosen as the target, and the model was trained to predict their brain responses using data from the remaining s − 1 participants. This gave us an average prediction score (correlation) for each voxel at each participant. To extrapolate to infinitely man...

  20. [20]

    The Wolf of Wall Street,

    Characters: - The character on the left is wearing a suit with a patterned tie and is holding a glass in his hand. Video reasoning The video appears to be from a scene in a movie or TV show, featuring two characters engaged in a conversation. The setting looks like a bar or a similar social environment, with dim lighting and a relaxed atmosphere. What mig...

  21. [21]

    Goodfellas

    is used for all the tests (appropriate because fMRI data is considered to have positive dependence (Genovese, 2000)). H Effectiveness of instruction-tuned video MLLMs vs audio MLLMs vs multimodal vs unimodal representations for various brain regions Fig. 8 show average normalized brain alignment of instruction-tuned video MLLMs vs instruction- tuned audio...

  22. [22]

    The Hangover,

    Engagement: The character might be trying to show interest or engagement in the conversation by leaning closer. Spatial Understanding The video appears to be from a movie or TV show set in a bar or restaurant. The setting includes a bar counter with bottles and glasses, suggesting it could be a scene from a film or series that takes place in a social or d...