BrainJanus: A Unified Model for Understanding and Generation across Brain, Vision, and Language

Changqing Zhang; Chi Zhang; Chunfeng Song; Haitao Wu; Jiamin Wu; Mianxin Liu; Qihao Zheng; Qirui Zhang; Shangquan Sun; Wanli Ouyang

arxiv: 2606.30319 · v1 · pith:2IBHUE4Ynew · submitted 2026-06-29 · 💻 cs.CV · cs.LG

BrainJanus: A Unified Model for Understanding and Generation across Brain, Vision, and Language

Haitao Wu , Qirui Zhang , Zhouheng Yao , Shangquan Sun , Qihao Zheng , Mianxin Liu , Chi Zhang , Wanli Ouyang

show 3 more authors

Chunfeng Song Changqing Zhang Jiamin Wu

This is my paper

Pith reviewed 2026-06-30 06:42 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords brain modelingmultimodal integrationneural tokenizationautoregressive generationbrain encodingbrain decodingvision language models

0 comments

The pith

BrainJanus turns brain signals into tokens shared with images and text for any-direction translation in one model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that brain encoding and decoding work better when treated as parts of a single multimodal system rather than isolated tasks. It does this by quantizing raw brain activity into discrete tokens that sit in the same space as tokens from vision and language. An autoregressive next-token model then generates outputs in any direction, from brain to image or text to brain. A sympathetic reader would care because this matches the brain's own role as an integrator across senses and could reduce the need for separate models and external priors for each mapping.

Core claim

BrainJanus is the first unified brain model that integrates brain, vision, and language within a single framework. It introduces a Unified Brain Tokenizer to quantize continuous neural dynamics into discrete tokens aligned with visual and linguistic representations in a shared Omni space. Building on this, an All-in-One autoregressive architecture uses next-token prediction to enable seamless any-to-any generation, which encompasses image-to-brain and text-to-brain encoding, and brain-to-image and brain-to-text decoding. The framework achieves superior performance across diverse benchmarks, exhibits zero-shot generalization, and preserves interpretable biological topography.

What carries the argument

The Unified Brain Tokenizer, which quantizes continuous neural dynamics into discrete tokens aligned with visual and linguistic representations in a shared Omni space, allowing the All-in-One autoregressive next-token model to treat brain, vision, and language uniformly for any-to-any generation.

If this is right

Any-to-any generation becomes possible without separate models for each direction.
Performance exceeds prior unimodal or separate-task approaches on encoding and decoding benchmarks.
Zero-shot application to new tasks or datasets without retraining.
Biological topography of brain activity remains interpretable after training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The shared token space could let researchers compare how different senses represent the same concept by inspecting token neighborhoods.
If the quantization step succeeds, similar tokenizers might be tested on other continuous signals such as muscle activity or eye tracking.
A practical next step would be to measure whether the model supports real-time closed-loop tasks where brain output directly drives image or text generation.

Load-bearing premise

Continuous neural dynamics can be quantized into discrete tokens that align with visual and linguistic representations in a shared space while preserving interpretable biological topography.

What would settle it

Demonstration that the brain-derived tokens fail to align with vision or language tokens in the shared space, or that the resulting model erases measurable biological topography of brain activity on standard mapping tasks.

Figures

Figures reproduced from arXiv: 2606.30319 by Changqing Zhang, Chi Zhang, Chunfeng Song, Haitao Wu, Jiamin Wu, Mianxin Liu, Qihao Zheng, Qirui Zhang, Shangquan Sun, Wanli Ouyang, Zhouheng Yao.

**Figure 1.** Figure 1: Illustration of the biological nature and the proposed modeling paradigm. (a) Biological Nature: The brain processes visual stimuli by projecting them into a unified multimodal space (Huth et al., 2016) that integrates both low-level pixel information and high-level semantic features. (b) Modeling Paradigm: Comparison between previous approaches and ours. Unlike previous task-specific, unidirectional pipel… view at source ↗

**Figure 2.** Figure 2: An overview of BrainJanus. The input data, regardless of its modality, is tokenized into a shared token space and then organized into a token sequence. BrainJanus processes these tokens autoregressively, enabling arbitrary transformations among brain, vision, and language modalities [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of brain caption decoding results. GroundTruth image captions are compared with captions decoded from fMRI voxel signals using MindEye2, UMBRAE, MindLLM, and BrainJanus (ours). Gray indicates key objects, Green highlights indicate semantic matches with the GroundTruth, while red highlights denote errors. More results are shown in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of visual decoding for Subject 1. Our method outperforms MindEye2 by generating reconstructions with higher semantic accuracy, better preservation of object and action attributes, and improved structural consistency across diverse visual categories. More examples can be found at [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative result of brain encoding. Additional examples are provided in the appendix (see [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation results of the brain tokenizer under different codebook sizes and compression ratios. We report reconstruction fidelity (MSE, SSIM), intermediate feature similarity (AlexNet2), and high-level semantic alignment (CLIP). The results reveal a clear trade-off between compression and information preservation. sizes. We evaluate the representations in terms of codebook usage, reconstruction quality, ali… view at source ↗

**Figure 7.** Figure 7: Distribution comparison of CLIP Scores across three caption sources. The density plots illustrate the semantic alignment between images and captions generated by Qwen3-VL-235B (red), GIT-large (green), and the original COCO ground truth (orange). The distinct rightward shift of the Qwen distribution indicates that our generated captions achieve superior image-text alignment, surpassing both the baseline mo… view at source ↗

**Figure 8.** Figure 8: Distribution of brain activity beta values across three different trials under the same stimulus ID 3050 (Subject 1). Significant inter-trial variability can be observed. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison of brain caption decoding results. GroundTruth image captions are compared with captions decoded from fMRI voxel signals using MindEye2, UMBRAE, MindLLM, and BrainJanus (ours). Green highlights indicate semantic matches with the GroundTruth, while red highlights denote errors. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative results of brain-to-image decoding. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative results of image-to-brain encoding. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Distributions of voxel consistency measured by Mean Squared Error (MSE) in (a) the training set and (b) the test set. Histograms show voxel MSE values based on three repeated trials per image. Blue (Pair-wise): MSE computed between pairs of trials for the same stimulus. Teal (To-Mean): MSE between each trial and the mean response across repetitions of the same stimulus. Vertical dashed lines indicate the … view at source ↗

**Figure 13.** Figure 13: Distributions of voxel consistency measured by Cosine Similarity. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Distributions of voxel consistency measured by Pearson Correlation Coefficient. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative result of Brain Tokenizer Pretraining. Vision Encoder Vision Decoder Padding 0 0 UnPadding Voxel VQVAE Feature VQVAE Feature Previous Method Padding Hacking Module Module [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Illustration of the proposed Padding Hacking baseline under the encoding-decoding protocol. Instead of learning a biologically meaningful mapping from images to neural responses, the encoder directly stores the ground-truth visual embedding (e.g., CLIP or VQ-VAE) by zero-padding it to match the voxel dimensionality. The decoder then performs the inverse un-padding operation to recover the embedding for im… view at source ↗

read the original abstract

Modeling the bidirectional correspondence between external sensory stimuli and internal neural activity has emerged as a critical frontier in neuroscience. However, existing approaches predominantly treat brain encoding and decoding as isolated tasks, relying heavily on unimodal alignment and external priors while overlooking the brain's intrinsic nature as a multimodal integration system. To address these limitations, we propose BrainJanus, the first unified brain model that integrates brain, vision, and language within a single framework. Specifically, we introduce a Unified Brain Tokenizer to quantize continuous neural dynamics into discrete tokens aligned with visual and linguistic representations in a shared Omni space. Building on this, we utilize an All-in-One autoregressive architecture that leverages next-token prediction to enable seamless any-to-any generation, which encompasses image-to-brain and text-to-brain encoding, and brain-to-image and brain-to-text decoding. Extensive experiments demonstrate that BrainJanus achieves superior performance across diverse benchmarks. Furthermore, our framework exhibits zero-shot generalization and preserves interpretable biological topography, highlighting its potential as a general-purpose brain modeling paradigm. The code is available at \href{https://github.com/HaitaoWuTJU/BrainJanus}{GitHub}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BrainJanus claims a first unified brain-vision-language model with tokenization and any-to-any generation, but the abstract supplies no methods, data, or validation, leaving every claim unevaluable.

read the letter

The one thing to know is that this paper's abstract announces BrainJanus as the first single framework that puts brain signals, vision, and language into one tokenizer and one autoregressive model for any-to-any tasks like brain-to-image or text-to-brain. Code is on GitHub, which is a concrete plus for anyone who might want to inspect the implementation later.

Beyond that announcement there is nothing to assess. No equations for the Unified Brain Tokenizer, no loss terms, no dataset sizes, no error bars, no ablations, and no numbers that would let a reader check whether the discretization of neural dynamics actually keeps biological topography or produces real cross-modal alignments. The stress-test concern about whether continuous signals survive quantization without distortion is therefore unanswerable from what is shown.

The work is aimed at researchers in multimodal neuroscience or brain-computer interfaces who are looking for a general-purpose any-to-any brain model. At present the abstract alone does not give enough substance for a serious referee to spend time on; the central claims cannot be checked for internal consistency or external validity. I would not bring it to a reading group or cite it until the full methods and results are available.

Referee Report

2 major / 0 minor

Summary. The paper proposes BrainJanus as the first unified model integrating brain, vision, and language. It introduces a Unified Brain Tokenizer that quantizes continuous neural dynamics into discrete tokens aligned with visual and linguistic representations in a shared Omni space, and an All-in-One autoregressive architecture using next-token prediction for any-to-any generation (image-to-brain, text-to-brain, brain-to-image, brain-to-text). The manuscript claims superior performance across diverse benchmarks, zero-shot generalization, and preservation of interpretable biological topography, with code released on GitHub.

Significance. If the tokenizer and autoregressive claims hold with supporting evidence, the work could provide a general-purpose paradigm for multimodal brain modeling that unifies encoding and decoding tasks.

major comments (2)

[Abstract] Abstract: The central claims of superior performance across benchmarks and zero-shot generalization are asserted without any methods description, datasets, quantitative results, error bars, or ablation studies, preventing evaluation of whether the Unified Brain Tokenizer or All-in-One architecture actually delivers these outcomes.
[Abstract] Abstract: The Unified Brain Tokenizer is described as quantizing continuous neural dynamics into discrete tokens that align in Omni space while preserving biological topography, but no equations, loss terms, architecture details, or validation experiments for topography preservation or alignment fidelity are supplied; this is load-bearing for the any-to-any generation claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments. We address each major point below, noting that the abstract is a concise summary per standard academic conventions, with full technical details in the manuscript body.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of superior performance across benchmarks and zero-shot generalization are asserted without any methods description, datasets, quantitative results, error bars, or ablation studies, preventing evaluation of whether the Unified Brain Tokenizer or All-in-One architecture actually delivers these outcomes.

Authors: The abstract is designed as a high-level overview of contributions and results, as is conventional for papers in this field due to length constraints. The methods, datasets, quantitative results with error bars, ablation studies, and evaluation of the tokenizer and architecture are fully detailed in Sections 3 (Methodology), 4 (Experiments), and the supplementary material, enabling complete assessment of the claims. revision: no
Referee: [Abstract] Abstract: The Unified Brain Tokenizer is described as quantizing continuous neural dynamics into discrete tokens that align in Omni space while preserving biological topography, but no equations, loss terms, architecture details, or validation experiments for topography preservation or alignment fidelity are supplied; this is load-bearing for the any-to-any generation claims.

Authors: The abstract summarizes the tokenizer at a high level. The equations, loss terms, architecture details, and validation experiments for topography preservation and alignment fidelity are provided in Section 3.1 (Unified Brain Tokenizer) and Section 4.2 (Interpretability Analysis), directly supporting the any-to-any generation results reported in the experiments. revision: no

Circularity Check

0 steps flagged

No derivation chain or equations presented; no circularity detectable

full rationale

The abstract and available text introduce the Unified Brain Tokenizer and All-in-One autoregressive architecture as components but supply no equations, loss functions, parameter-fitting procedures, or self-citations that could reduce any claimed result to its inputs by construction. No predictions are framed as first-principles derivations, and no load-bearing steps match the enumerated circularity patterns. The work is therefore self-contained as an empirical modeling proposal without detectable circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all fields left empty because only the abstract is available.

pith-pipeline@v0.9.1-grok · 5770 in / 1036 out tokens · 24285 ms · 2026-06-30T06:42:46.730334+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 21 canonical work pages · 12 internal anchors

[1]

Spice: Semantic propositional image caption evaluation

Anderson, P., Fernando, B., Johnson, M., and Gould, S. Spice: Semantic propositional image caption evaluation. InComputer Vision–ECCV 2016: 14th European Confer- ence, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pp. 382–398. Springer,

2016
[2]

Mind- simulator: Exploring brain concept localization via syn- thetic fmri.arXiv preprint arXiv:2503.02351,

Bao, G., Zhang, Q., Gong, Z., Wu, Z., and Miao, D. Mind- simulator: Exploring brain concept localization via syn- thetic fmri.arXiv preprint arXiv:2503.02351,

work page arXiv
[3]

O., Fonseca, A

Caro, J. O., Fonseca, A. H. d. O., Averill, C., Rizvi, S. A., Rosati, M., Cross, J. L., Mittal, P., Zappala, E., Levine, D., Dhodapkar, R. M., et al. Brainlm: A foundation model for brain activity recordings.bioRxiv, pp. 2023–09,

2023
[4]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., and Ruan, C. Janus-pro: Unified multimodal understand- ing and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Eegformer: Towards transferable and inter- pretable large-scale eeg foundation model.arXiv preprint arXiv:2401.10278,

Chen, Y ., Ren, K., Song, K., Wang, Y ., Wang, Y ., Li, D., and Qiu, L. Eegformer: Towards transferable and inter- pretable large-scale eeg foundation model.arXiv preprint arXiv:2401.10278,

work page arXiv
[7]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al. Emerging proper- ties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Brain captioning: Decoding human brain activ- ity into images and text.arXiv preprint arXiv:2305.11560,

Ferrante, M., Ozcelik, F., Boccato, T., VanRullen, R., and Toschi, N. Brain captioning: Decoding human brain activ- ity into images and text.arXiv preprint arXiv:2305.11560,

work page arXiv
[9]

P., and Wool- rich, M

Huang, R., Cho, S., Gohil, C., Jones, O. P., and Wool- rich, M. Meg-gpt: A transformer-based foundation model for magnetoencephalography data.arXiv preprint arXiv:2510.18080,

work page arXiv
[10]

Large brain model for learning generic representations with tremendous eeg data in bci.arXiv preprint arXiv:2405.18765, 2024

Jiang, W.-B., Zhao, L.-M., and Lu, B.-L. Large brain model for learning generic representations with tremendous eeg data in bci.arXiv preprint arXiv:2405.18765,

work page arXiv
[11]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

and Zhang, Z

Mai, W. and Zhang, Z. Unibrain: Unify image reconstruc- tion and captioning all in one diffusion model from human brain activity.arXiv preprint arXiv:2308.07428,

work page arXiv
[13]

Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

Scotti, P. S., Tripathy, M., Villanueva, C. K. T., Kneeland, R., Chen, T., Narang, A., Santhirasegaran, C., Xu, J., Naselaris, T., Norman, K. A., et al. Mindeye2: Shared- subject models enable fmri-to-image with 1 hour of data. arXiv preprint arXiv:2403.11207,

work page arXiv
[14]

Decoding natural images from eeg for object recognition.arXiv preprint arXiv:2308.13234, 2023

10 BrainJanus: A Unified Model for Understanding and Generation across Brain, Vision, and Language Song, Y ., Liu, B., Li, X., Shi, N., Wang, Y ., and Gao, X. Decoding natural images from eeg for object recognition. arXiv preprint arXiv:2308.13234,

work page arXiv
[15]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Sun, P., Jiang, Y ., Chen, S., Zhang, S., Peng, B., Luo, P., and Yuan, Z. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

GIT: A Generative Image-to-text Transformer for Vision and Language

Wang, G., Liu, W., He, Y ., Xu, C., Ma, L., and Li, H. Eegpt: Pretrained transformer for universal and reliable repre- sentation of eeg signals.Advances in Neural Information Processing Systems, 37:39249–39280, 2024a. Wang, J., Yang, Z., Hu, X., Li, L., Lin, K., Gan, Z., Liu, Z., Liu, C., and Wang, L. Git: A generative image-to- text transformer for visio...

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Emu3: Next-Token Prediction is All You Need

Wang, S., Liu, S., Tan, Z., and Wang, X. Mindbridge: A cross-subject brain decoding framework. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11333–11342, 2024b. Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y ., Wang, J., Zhang, F., Wang, Y ., Li, Z., Yu, Q., et al. Emu3: Next-token prediction is all you need.arXiv...

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Brainomni: A brain foundation model for unified eeg and meg signals

Xiao, Q., Cui, Z., Zhang, C., Chen, S., Wu, W., Thwaites, A., Woolgar, A., Zhou, B., and Zhang, C. Brainomni: A brain foundation model for unified eeg and meg signals. arXiv preprint arXiv:2505.18185,

work page arXiv
[20]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Xie, J., Mao, W., Bai, Z., Zhang, D. J., Wang, W., Lin, K. Q., Gu, Y ., Chen, Z., Yang, Z., and Shou, M. Z. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

BERTScore: Evaluating Text Generation with BERT

Zhang, T., Kishore, V ., Wu, F., Weinberger, K. Q., and Artzi, Y . Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[23]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., and Levy, O. Transfusion: Predict the next token and dif- fuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024a. Zhou, Q., Du, C., Wang, S., and He, H. Clip-mused: Clip- guided multi-subject visual neural information semantic deco...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Detailed Results B.1

Table 6.Hyperparameters of BrainTokenizer Hyperparameter Value Base channel size 64 Encoder channel multipliers [1, 2, 2, 2, 4, 4, 4] Decoder channel multipliers [1, 2, 2, 2, 4, 4, 4] Residual blocks per level 2 Downsampling factor per level 2 Latent channel dimension 512 Codebook size 128 Codebook embedding dimension 32 Commitment loss weight 0.25 Entrop...

2022
[25]

However, these local measures suffer from two fundamental limitations

predominantly rely on voxel-wise metrics, such as Pearson correlation and MSE. However, these local measures suffer from two fundamental limitations. First, they neglect global cortical topography, failing to penalize structurally incoherent predictions. Second, they are overly sensitive to the intrinsic stochasticity of neural responses (i.e., trial-to-t...

2025

[1] [1]

Spice: Semantic propositional image caption evaluation

Anderson, P., Fernando, B., Johnson, M., and Gould, S. Spice: Semantic propositional image caption evaluation. InComputer Vision–ECCV 2016: 14th European Confer- ence, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pp. 382–398. Springer,

2016

[2] [2]

Mind- simulator: Exploring brain concept localization via syn- thetic fmri.arXiv preprint arXiv:2503.02351,

Bao, G., Zhang, Q., Gong, Z., Wu, Z., and Miao, D. Mind- simulator: Exploring brain concept localization via syn- thetic fmri.arXiv preprint arXiv:2503.02351,

work page arXiv

[3] [3]

O., Fonseca, A

Caro, J. O., Fonseca, A. H. d. O., Averill, C., Rizvi, S. A., Rosati, M., Cross, J. L., Mittal, P., Zappala, E., Levine, D., Dhodapkar, R. M., et al. Brainlm: A foundation model for brain activity recordings.bioRxiv, pp. 2023–09,

2023

[4] [4]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., and Ruan, C. Janus-pro: Unified multimodal understand- ing and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Eegformer: Towards transferable and inter- pretable large-scale eeg foundation model.arXiv preprint arXiv:2401.10278,

Chen, Y ., Ren, K., Song, K., Wang, Y ., Wang, Y ., Li, D., and Qiu, L. Eegformer: Towards transferable and inter- pretable large-scale eeg foundation model.arXiv preprint arXiv:2401.10278,

work page arXiv

[7] [7]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al. Emerging proper- ties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Brain captioning: Decoding human brain activ- ity into images and text.arXiv preprint arXiv:2305.11560,

Ferrante, M., Ozcelik, F., Boccato, T., VanRullen, R., and Toschi, N. Brain captioning: Decoding human brain activ- ity into images and text.arXiv preprint arXiv:2305.11560,

work page arXiv

[9] [9]

P., and Wool- rich, M

Huang, R., Cho, S., Gohil, C., Jones, O. P., and Wool- rich, M. Meg-gpt: A transformer-based foundation model for magnetoencephalography data.arXiv preprint arXiv:2510.18080,

work page arXiv

[10] [10]

Large brain model for learning generic representations with tremendous eeg data in bci.arXiv preprint arXiv:2405.18765, 2024

Jiang, W.-B., Zhao, L.-M., and Lu, B.-L. Large brain model for learning generic representations with tremendous eeg data in bci.arXiv preprint arXiv:2405.18765,

work page arXiv

[11] [11]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

and Zhang, Z

Mai, W. and Zhang, Z. Unibrain: Unify image reconstruc- tion and captioning all in one diffusion model from human brain activity.arXiv preprint arXiv:2308.07428,

work page arXiv

[13] [13]

Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

Scotti, P. S., Tripathy, M., Villanueva, C. K. T., Kneeland, R., Chen, T., Narang, A., Santhirasegaran, C., Xu, J., Naselaris, T., Norman, K. A., et al. Mindeye2: Shared- subject models enable fmri-to-image with 1 hour of data. arXiv preprint arXiv:2403.11207,

work page arXiv

[14] [14]

Decoding natural images from eeg for object recognition.arXiv preprint arXiv:2308.13234, 2023

10 BrainJanus: A Unified Model for Understanding and Generation across Brain, Vision, and Language Song, Y ., Liu, B., Li, X., Shi, N., Wang, Y ., and Gao, X. Decoding natural images from eeg for object recognition. arXiv preprint arXiv:2308.13234,

work page arXiv

[15] [15]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Sun, P., Jiang, Y ., Chen, S., Zhang, S., Peng, B., Luo, P., and Yuan, Z. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

GIT: A Generative Image-to-text Transformer for Vision and Language

Wang, G., Liu, W., He, Y ., Xu, C., Ma, L., and Li, H. Eegpt: Pretrained transformer for universal and reliable repre- sentation of eeg signals.Advances in Neural Information Processing Systems, 37:39249–39280, 2024a. Wang, J., Yang, Z., Hu, X., Li, L., Lin, K., Gan, Z., Liu, Z., Liu, C., and Wang, L. Git: A generative image-to- text transformer for visio...

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Emu3: Next-Token Prediction is All You Need

Wang, S., Liu, S., Tan, Z., and Wang, X. Mindbridge: A cross-subject brain decoding framework. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11333–11342, 2024b. Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y ., Wang, J., Zhang, F., Wang, Y ., Li, Z., Yu, Q., et al. Emu3: Next-token prediction is all you need.arXiv...

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Brainomni: A brain foundation model for unified eeg and meg signals

Xiao, Q., Cui, Z., Zhang, C., Chen, S., Wu, W., Thwaites, A., Woolgar, A., Zhou, B., and Zhang, C. Brainomni: A brain foundation model for unified eeg and meg signals. arXiv preprint arXiv:2505.18185,

work page arXiv

[20] [20]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Xie, J., Mao, W., Bai, Z., Zhang, D. J., Wang, W., Lin, K. Q., Gu, Y ., Chen, Z., Yang, Z., and Shou, M. Z. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

BERTScore: Evaluating Text Generation with BERT

Zhang, T., Kishore, V ., Wu, F., Weinberger, K. Q., and Artzi, Y . Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[23] [23]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., and Levy, O. Transfusion: Predict the next token and dif- fuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024a. Zhou, Q., Du, C., Wang, S., and He, H. Clip-mused: Clip- guided multi-subject visual neural information semantic deco...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [24]

Detailed Results B.1

Table 6.Hyperparameters of BrainTokenizer Hyperparameter Value Base channel size 64 Encoder channel multipliers [1, 2, 2, 2, 4, 4, 4] Decoder channel multipliers [1, 2, 2, 2, 4, 4, 4] Residual blocks per level 2 Downsampling factor per level 2 Latent channel dimension 512 Codebook size 128 Codebook embedding dimension 32 Commitment loss weight 0.25 Entrop...

2022

[25] [25]

However, these local measures suffer from two fundamental limitations

predominantly rely on voxel-wise metrics, such as Pearson correlation and MSE. However, these local measures suffer from two fundamental limitations. First, they neglect global cortical topography, failing to penalize structurally incoherent predictions. Second, they are overly sensitive to the intrinsic stochasticity of neural responses (i.e., trial-to-t...

2025