pith. sign in

arxiv: 2605.19352 · v1 · pith:5BYSDOYHnew · submitted 2026-05-19 · 🧬 q-bio.NC · cs.AI· cs.LG

Brain alignment of reasoning and action representations from vision-language and action models during naturalistic gameplay

Pith reviewed 2026-05-20 02:37 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.AIcs.LG
keywords brain encodingvision-language modelsaction modelsfMRIvideo gamesvariance partitioningcortical hierarchymultimodal representations
0
0 comments X

The pith

Action-specialized fine-tuning makes large-action models align asymmetrically with action-related brain signals in higher cortical regions, unlike symmetric alignment in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how vision-language models and large-action models align their internal representations with human brain activity measured by fMRI while participants play naturalistic video games. Both model families predict voxel responses better than reinforcement learning baselines, with prompt-driven improvements growing larger toward frontal-parietal and motor-planning areas. Variance partitioning shows that large-action models carry substantially more unique information from action prompts than from reasoning prompts, with the asymmetry most pronounced in frontal-motor cortex, whereas vision-language models distribute unique variance almost equally between the two prompt types. The results indicate that fine-tuning for action tasks reshapes multimodal representations to better match the brain's action-relevant computations, even when overall encoding accuracy remains statistically equivalent across the two model classes.

Core claim

Action-specialized fine-tuning reorganizes multimodal representations toward action-relevant neural computations even when whole-brain prediction accuracy is statistically equivalent between VLM and LAM. This is evidenced by prompt-asymmetric variance partitioning in LAMs (27% unique action versus -5% unique reasoning) that is strongest in frontal-motor cortex, in contrast to the prompt-symmetric pattern in VLMs (12.5% unique action versus 13.6% unique reasoning), while both families outperform RL baselines in voxel-wise encoding with gains that scale up the cortical hierarchy.

What carries the argument

Prompt-driven variance partitioning that decomposes brain encoding performance into unique action-prompt variance, unique reasoning-prompt variance, and shared components, measured across the cortical hierarchy from early visual to frontal-motor regions.

If this is right

  • Prompt-driven gains are largest in frontal-parietal and motor-planning regions, roughly twice as large as gains in early visual cortex.
  • LAMs exhibit strong prompt asymmetry favoring action representations, especially in frontal-motor cortex.
  • Both VLMs and LAMs achieve higher voxel-wise encoding performance than RL baselines even under matched feature dimensionality.
  • The reorganization toward action-relevant computations holds despite statistically equivalent overall prediction accuracy between VLM and LAM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The asymmetry effect may indicate that action-oriented training objectives better capture the hierarchical progression of human planning and motor preparation during interactive tasks.
  • Similar variance-partitioning methods could be applied to other interactive domains such as language-guided navigation or robotic manipulation to test whether the asymmetry generalizes beyond video games.
  • Brain alignment differences could eventually serve as an auxiliary signal for selecting or designing fine-tuning objectives in multimodal models.

Load-bearing premise

The reported differences in unique variance from action versus reasoning prompts and their scaling with cortical hierarchy arise from the models' representational organization rather than from specifics of the encoding model fitting, prompt construction, or participant gameplay variability.

What would settle it

Repeating the full encoding and variance-partitioning pipeline after swapping or randomizing the action-focused and reasoning-focused prompt sets, or after replacing the current encoding model with an alternative linear regressor, would remove the LAM asymmetry if the claim is incorrect.

Figures

Figures reproduced from arXiv: 2605.19352 by Anant Khandelwal, Bapi S. Raju, Khushbu Pahwa, Manish Gupta, Satya Sai Srinath Namburi, Subba Reddy Oota, Tanmoy Chakraborty.

Figure 1
Figure 1. Figure 1: Brain-alignment pipeline for naturalistic Atari gameplay. Participants played Atari￾style video games during fMRI recording, producing TR-aligned brain responses (top row). The same gameplay frames were processed by two foundation-model families: Vision-language models (VLMs) and Large-action models (LAMs), conditioning on action-plan or goal-reasoning prompts (bottom). Voxel-wise encoding models trained o… view at source ↗
Figure 2
Figure 2. Figure 2: Whole-brain voxel-wise encoding performance. (left) Effect of feature dimensional￾ity under the no-prompt condition: VLM (orange) and LAM (green) features evaluated at three dimensionalities (8, 64, 1024), compared against the RL baselines EMPA (gray dashed, r=0.162) and DDQN (light blue dashed, r=0.097). * indicates significant pairwise gain from 8 to 64 dim (paired t-test, p < 0.05); the 64→1024 transiti… view at source ↗
Figure 3
Figure 3. Figure 3: Whole-brain alignment across two VLMs and two LAMs under reasoning and action prompts. Average Pearson correlation across participants and voxels per model. Left: reasoning prompts. Right: action prompts (hatched). Orange: VLMs; Green: LAMs. Darker shades denote the primary model in each family (Qwen2.5-VL, UI-TARS); lighter shades denote the secondary model (InternVL3, OS-Atlas). Error bars denote mean ± … view at source ↗
Figure 4
Figure 4. Figure 4: Brain alignment averaged across participants and voxels per ROI for prompted vs. no [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Variance partitioning of action vs. reasoning prompts. R2 decomposed into shared variance and prompt-unique components for VLM (Qwen2.5-VL, orange) and LAM (UI-TARS-7B￾DPO, green), averaged across participants. Numbers above bars show absolute R2 and (in parentheses) percentage of joint explained variance. VLM unique variance is balanced across prompts; LAM is action-asymmetric, with negative unique-reason… view at source ↗
Figure 6
Figure 6. Figure 6: Spatial visualization confirms the VLM/LAM dissociation. Group-averaged across subjects per-voxel difference maps (rReasoning - rAction) these regions being driven primarily by direct visual-input encoding rather than prompt-conditioned representations. [RQ3]: Variance partitioning reveals prompt-symmetric alignment in VLMs but action-dominant brain alignment in LAMs. Although VLM and LAM achieve comparabl… view at source ↗
Figure 10
Figure 10. Figure 10: Our findings demonstrate that shared variance increases from early visual to higher visual [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 7
Figure 7. Figure 7: Flattened cortical surfaces for language-, visual- and motor-selective regions displayed on [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Brain alignment averaged across participants and voxels per ROI for prompted vs. no [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Voxel-wise difference in brain alignment between prompted and no-prompt VLM representations. Group-averaged across subjects cortical flatmap (both hemispheres, ‘fsaverage’) showing per-voxel differences in Pearson correlation, rprompted-rno-prompt, for Qwen2.5-VL under (a) reasoning prompts and (b) action prompts. Red voxels indicate higher alignment under the prompted condition; blue voxels indicate highe… view at source ↗
Figure 10
Figure 10. Figure 10: Variance partitioning of action vs. reasoning prompts. R2 decomposed into shared variance and prompt-unique components for VLM (Qwen2.5-VL, orange) and LAM (UI-TARS-7B￾DPO, green), averaged across participants. Numbers above bars show absolute R2 and (in parentheses) percentage of joint explained variance. VLM unique variance is balanced across prompts; LAM is action-asymmetric, with negative unique-reaso… view at source ↗
Figure 11
Figure 11. Figure 11: Brain alignment averaged across participants and voxels, using the best-performing layer [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
read the original abstract

Understanding how humans and artificial intelligence systems predict and plan by interacting with their environment is a fundamental challenge at the intersection of neuroscience and machine learning. Most brain-encoding studies focus on aligning artificial models with brain activity during language comprehension or passive visual processing, while interactive brain-alignment studies have to date been largely limited to reinforcement-learning (RL) agents and theory-based models. To address this gap, we study brain alignment of representative models from two foundation-model families, namely vision-language models (VLMs) and large-action models (LAMs), using fMRI recordings from participants playing naturalistic Atari-style video games. Specifically, we examine how action-focused and reasoning-focused prompts shape model's internal representations and align with fMRI brain activity. First, we find that both VLMs and LAMs exhibit significantly exhibit voxel-wise encoding performance than RL baselines, with the advantage holding even under matched feature dimensionality. Second, prompt-driven gains scale with the cortical processing hierarchy: the largest improvements appear in frontal-parietal and motor-planning regions, while early visual cortex gains roughly half as much. Third, variance partitioning reveals a qualitatively different representational organization: VLM is prompt-symmetric (12.5% unique action vs. 13.6% unique reasoning), whereas LAM is prompt-asymmetric (27% unique action vs. -5% unique reasoning), with the asymmetry strongest in frontal-motor cortex. Together, these results demonstrate that action-specialized fine-tuning reorganizes multimodal representations toward action-relevant neural computations even when whole-brain prediction accuracy is statistically equivalent between VLM and LAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript examines brain encoding of vision-language models (VLMs) and large-action models (LAMs) using fMRI data from participants engaged in naturalistic Atari-style gameplay. It reports that both model classes outperform RL baselines in voxel-wise prediction accuracy even with matched feature dimensionality, that prompt-driven improvements scale with cortical hierarchy (stronger in frontal-parietal and motor areas), and that variance partitioning reveals prompt-symmetric unique variances for VLMs (12.5% action vs. 13.6% reasoning) but prompt-asymmetric unique variances for LAMs (27% action vs. -5% reasoning), with the asymmetry most pronounced in frontal-motor cortex. The central interpretation is that action-specialized fine-tuning reorganizes multimodal representations toward action-relevant neural computations despite statistically equivalent whole-brain accuracy between VLM and LAM.

Significance. If the variance-partitioning results hold after appropriate controls, the work would provide empirical evidence that fine-tuning on action tasks can shift AI representations to better match human brain activity during interactive, goal-directed behavior, particularly in higher-order cortical regions. This extends brain-alignment research beyond passive perception or language tasks into naturalistic gameplay and offers a concrete demonstration of how model specialization can produce qualitatively different alignment profiles even when overall predictive power is comparable.

major comments (1)
  1. The reorganization claim rests on the variance-partitioning results (abstract and corresponding results section): LAM shows 27% unique variance for action prompts versus -5% for reasoning prompts, while VLM is nearly symmetric. This asymmetry is presented as indexing genuine differences in representational organization induced by fine-tuning. However, the manuscript provides no indication of matched prompt controls (length, semantic content, or query format), checks for feature collinearity in the linear encoding models, or alternative partitioning methods such as dominance analysis or label-permutation tests. Without these, the reported split and its frontal-motor emphasis could arise from prompt construction or fitting artifacts rather than model reorganization, undermining the central interpretation that relies on whole-brain accuracy equivalence to highlight the qualitative contrast.
minor comments (2)
  1. The abstract contains a clear typographical error ('exhibit significantly exhibit voxel-wise encoding performance'); this should be corrected to improve readability.
  2. Negative unique variance values (e.g., -5% for reasoning in LAM) are reported without explanation; a short note on how such values are interpreted within the variance-partitioning framework (e.g., as suppression effects) would aid clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us strengthen the robustness of our variance partitioning analysis. We address the major comment point by point below.

read point-by-point responses
  1. Referee: The reorganization claim rests on the variance-partitioning results (abstract and corresponding results section): LAM shows 27% unique variance for action prompts versus -5% for reasoning prompts, while VLM is nearly symmetric. This asymmetry is presented as indexing genuine differences in representational organization induced by fine-tuning. However, the manuscript provides no indication of matched prompt controls (length, semantic content, or query format), checks for feature collinearity in the linear encoding models, or alternative partitioning methods such as dominance analysis or label-permutation tests. Without these, the reported split and its frontal-motor emphasis could arise from prompt construction or fitting artifacts rather than model reorganization, undermining the central interpretation that relies on whole-brain accuracy equivalence to highlight the qualitative contrast.

    Authors: We agree that explicit documentation of these controls is important for supporting the central claim. Prompts were constructed to have matched lengths (within 5 tokens) and parallel syntactic formats, differing only in the substitution of action-focused versus reasoning-focused clauses; full prompt templates are now provided in the revised Methods. Ridge regression was used for all encoding models, which inherently regularizes against collinearity, but we have added explicit variance inflation factor checks confirming that no feature set exceeded a VIF of 5. To further validate the partitioning, we performed label-permutation tests (1000 iterations) showing that the reported unique variances for LAMs remain significant (p < 0.01) while the negative unique variance for reasoning prompts is consistent with noise. We have also added dominance analysis as a supplementary robustness check; the relative importance ordering is preserved. These additions are detailed in a new subsection of the Results and expanded Methods, confirming that the prompt-asymmetric organization in LAMs is not an artifact of prompt construction or fitting. revision: yes

Circularity Check

0 steps flagged

No circularity detected in empirical brain-alignment results

full rationale

The paper reports empirical results from fMRI encoding models fitted to naturalistic gameplay data, including voxel-wise prediction accuracies, cortical hierarchy scaling of prompt gains, and variance partitioning into unique action versus reasoning components for VLM versus LAM. No equations, derivations, or self-referential definitions appear in the abstract or described methods that would reduce the reported percentages (e.g., 27% vs -5% unique variance) or hierarchy effects to fitted parameters by construction. The asymmetry and frontal-motor emphasis are presented as direct outputs of standard variance partitioning on the data, not as predictions forced by prior fits or self-citations. The study is self-contained as a data-driven comparison without load-bearing self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not detail any free parameters, axioms, or invented entities; full methods would be required to identify fitted encoding weights or modeling assumptions.

pith-pipeline@v0.9.0 · 5850 in / 1112 out tokens · 45716 ms · 2026-05-20T02:37:58.870137+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 4 internal anchors

  1. [1]

    The cortical representation of language timescales is shared between reading and listening.bioRxiv, pp

    Catherine Chen, Tom Dupré la Tour, Jack Gallant, Dan Klein, and Fatma Deniz. The cortical representation of language timescales is shared between reading and listening.bioRxiv, pp. 2023–01,

  2. [2]

    What can 1.8 billion regressions tell us about the pressures shaping high-level visual representation in brains and machines?bioRxiv, pp

    Colin Conwell, Jacob S Prince, Kendrick N Kay, George A Alvarez, and Talia Konkle. What can 1.8 billion regressions tell us about the pressures shaping high-level visual representation in brains and machines?bioRxiv, pp. 2022–03,

  3. [3]

    Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

    Botos Csaba, Sreejan Kumar, Austin Tudor David Andrews, Laurence Hunt, Chris Summerfield, Joshua B Tenenbaum, Rui Ponte Costa, Marcelo G Mattar, and Momchil Tomov. Reason to play: Behavioral and brain alignment between frontier lrms and human game learners.arXiv preprint arXiv:2605.08019,

  4. [4]

    Interpreting multimodal video transformers using brain recordings

    Dota Tianai Dong and Mariya Toneva. Interpreting multimodal video transformers using brain recordings. InICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls, 2023a. Dota Tianai Dong and Mariya Toneva. Vision-language integration in multimodal video transformers (partially) aligns with the brain.arXiv preprint arXiv:2311.07766, 202...

  5. [5]

    Unveiling multi-level and multi-modal semantic representations in the human brain using large language models

    Yuko Nakagi, Takuya Matsuyama, Naoko Koide-Majima, Hiroto Yamaguchi, Rieko Kubo, Shinji Nishimoto, and Yu Takagi. Unveiling multi-level and multi-modal semantic representations in the human brain using large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 20313–20338,

  6. [6]

    11 Subba Reddy Oota, Jashn Arora, Veeral Agarwal, Mounika Marreddy, Manish Gupta, and Bapi Surampudi. Neural language taskonomy: Which nlp tasks are the most predictive of fmri brain activity? InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3220–3237, 202...

  7. [7]

    Martin Schrimpf, Jonas Kubilius, Ha Hong, Najib J Majaj, Rishi Rajalingham, Elias B Issa, Kohitij Kar, Pouya Bashivan, Jonathan Prescott-Roy, Franziska Geiger, et al

    doi: 10.1017/S0031819100004733. Martin Schrimpf, Jonas Kubilius, Ha Hong, Najib J Majaj, Rishi Rajalingham, Elias B Issa, Kohitij Kar, Pouya Bashivan, Jonathan Prescott-Roy, Franziska Geiger, et al. Brain-score: Which artificial neural network for object recognition is most brain-like?BioRxiv, pp. 407007,

  8. [8]

    Human-level reinforcement learning through theory-based modeling, exploration, and planning.arXiv preprint arXiv:2107.12544,

    Pedro A Tsividis, Joao Loula, Jake Burga, Nathan Foss, Andres Campero, Thomas Pouncy, Samuel J Gershman, and Joshua B Tenenbaum. Human-level reinforcement learning through theory-based modeling, exploration, and planning.arXiv preprint arXiv:2107.12544,

  9. [9]

    Natural language supervision with a large and diverse dataset builds better models of human high-level visual cortex

    Aria Y Wang, Kendrick Kay, Thomas Naselaris, Michael J Tarr, and Leila Wehbe. Natural language supervision with a large and diverse dataset builds better models of human high-level visual cortex. BioRxiv, pp. 2022–09,

  10. [10]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025a. Xinyu Wang, Bohan Zhuang, and Qi Wu. Are large vision language models good game players? ...

  11. [11]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218,

  12. [12]

    A: Related Work • App

    Overview of Appendix Sections • App. A: Related Work • App. B: Detailed sub-ROIs of language, visual and auditory regions • App. C: Baseline Models Features • App. D: Prompt Templates for Brain Encoding Models • App. E: Hyperparameter Details • App. F: Details of explained variance partitioning • App. G: Statistical significance • App. H: Detailed Feature...

  13. [13]

    These studies demonstrate that features from RL agents map onto neural activity associated with decision-making and goal-directed control

    in predicting frontal activity during gameplay. These studies demonstrate that features from RL agents map onto neural activity associated with decision-making and goal-directed control. However, because RL agents are optimized primarily for reward-based action learning, their representations tend to encode action-relevant features and may miss the richer...

  14. [14]

    and later formalized in deep-RL (Ha & Schmidhuber, 2018). Recent work has begun probing whether modern foundation models, particularly VLMs and LAMs, acquire analogous internal world models when interacting with games and GUIs (Waytowich et al., 2024; Wang et al., 2025a; Xie et al., 2026), but these evaluations focus on behavioral benchmarks (task success...

  15. [15]

    You are controlling the avatar in this game. Look at the current game state. What is your immediate next action and why? Consider threats, goals, and optimal path

    feature matrix per subject. For DDQN (25M-step agent) (Van Hasselt et al., 2016), we loaded dqn_regressors_25M.bson and extracted per-episode action sequences as (keycode, frame_idx, timestamp) tuples, mapping pygame keycodes to action labels and computing normalized action distributions per run for behavioral comparison against human and EMPA policies. D...