arxiv: 2604.12537 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.AI

Recognition: unknown

MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

Ruoxiang Huang , Zhen Yuan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelspositional encodinginformation densitytraining-free methodmultimodal reasoningattention allocationtransformer modelsindex scaling

0 comments

The pith

MODIX rescales positional indices in vision-language models using information density scores to allocate finer granularity to informative content without training or architecture changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models assign identical positional indices to every token, which spreads attention evenly even when some image patches or text segments carry far more relevant information than others. This uniform assignment leaves redundant visual regions dominating the computation while important details receive less effective focus. MODIX computes a single score for each modality by combining covariance-based entropy that measures internal information density with cross-modal alignment that captures how well one modality supports the other. These scores then stretch or compress the positional steps so high-value tokens get smaller strides and finer resolution while low-value tokens get larger strides and coarser treatment. If the approach holds, models can achieve stronger multimodal reasoning on the same architecture simply by treating positional indices as a movable resource that adapts to the input's actual information distribution.

Core claim

MODIX jointly models intra-modal density via covariance-based entropy and inter-modal interaction via cross-modal alignment to derive unified scores, which rescale positional indices to allocate finer granularity to informative modalities while compressing redundant ones, without requiring any modification to model parameters or architecture. Experiments across diverse architectures and benchmarks demonstrate that MODIX consistently improves multimodal reasoning and adaptively reallocates attention according to task-dependent information distributions.

What carries the argument

Unified scores from covariance-based entropy for intra-modal density and cross-modal alignment for inter-modal interactions, used to rescale positional strides.

If this is right

Consistent gains in multimodal reasoning appear across multiple VLM architectures and benchmarks.
Attention is reallocated in a task-dependent way according to measured information distributions.
No parameter updates or retraining are needed to obtain the gains.
Positional encoding functions as an adaptive resource rather than a fixed grid.
Finer granularity is given to informative modalities while redundant ones are compressed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scoring logic could be tested on long-context language models to see whether compressing low-density tokens reduces effective sequence length without loss.
If the density measures prove stable, the method might extend to video-language tasks where temporal redundancy is even higher.
One could check whether the rescaling changes the model's behavior on adversarial inputs that deliberately create misleading alignment scores.

Load-bearing premise

Covariance-based entropy and cross-modal alignment scores accurately reflect task-relevant information density, and rescaling positional indices will improve attention allocation without introducing artifacts or hurting performance on other inputs.

What would settle it

Applying MODIX to a baseline VLM on a standard multimodal reasoning benchmark and observing no gain or a performance drop relative to uniform positional indexing.

Figures

Figures reproduced from arXiv: 2604.12537 by Ruoxiang Huang, Zhen Yuan.

**Figure 1.** Figure 1: MODIX framework. Dual pathways analyze multimodal embeddings E to compute information contributions C˜m, which determine adaptive vision strides while text maintains unit stride. Adjusted indices P ′ integrate directly into RoPE without parameter updates. 3.3. Adaptive Stride Scaling Theoretical Motivation. RoPE encodes positional information through rotation matrices applied to query and key vectors, mak… view at source ↗

**Figure 2.** Figure 2: Case analysis of MODIX across four representative task types. Each panel reports the decomposed information contributions (intra-modal, inter-modal, and fused via Eq. 13) together with the resulting adaptive vision stride. MODIX assigns finer visual granularity when vision contribution dominates and coarser spacing when text dominates. When C˜ vision is small relative to C˜ text, we have ∆vision > 1, yield… view at source ↗

read the original abstract

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional indices to all tokens, overlooking variations in information density within and across modalities, which leads to inefficient attention allocation where redundant visual regions dominate while informative content is underrepresented. We identify positional granularity as an implicit resource and propose MODIX (Multimodal Information-Driven Positional IndeX Scaling), a training-free framework that dynamically adapts positional strides based on modality-specific contributions. MODIX jointly models intra-modal density via covariance-based entropy and inter-modal interaction via cross-modal alignment to derive unified scores, which rescale positional indices to allocate finer granularity to informative modalities while compressing redundant ones, without requiring any modification to model parameters or architecture. Experiments across diverse architectures and benchmarks demonstrate that MODIX consistently improves multimodal reasoning and adaptively reallocates attention according to task-dependent information distributions, suggesting that positional encoding should be treated as an adaptive resource in Transformers for multimodal sequence modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MODIX proposes a training-free way to rescale positional indices in VLMs using covariance entropy and cross-modal alignment, but the abstract gives no numbers or derivations to check the gains.

read the letter

The main point is that this paper introduces MODIX as a training-free method to adapt positional strides in vision-language models. It measures intra-modal density with covariance-based entropy and inter-modal interaction with alignment scores, then uses those to give finer index granularity to informative tokens and coarser ones to redundant parts. No model parameters change, which keeps it lightweight. That framing of positional encoding as an adjustable resource is the fresh angle here, and it correctly spots how uniform indices can let uninformative visual regions hog attention in multimodal sequences. The motivation lands cleanly without overclaiming prior art. The approach stays practical for existing architectures. The soft spots stand out because the abstract asserts consistent gains across models and benchmarks yet shows zero quantitative results, error bars, ablations, or explicit formulas for the scores. Without those, it is impossible to verify whether the entropy and alignment proxies actually track task-relevant information or whether the rescaling avoids artifacts on some inputs. The central assumption that these signals will reliably improve downstream reasoning therefore stays untested in the provided text. This work would suit researchers who optimize VLMs for efficiency or who experiment with positional encodings. A reader already thinking about adaptive attention mechanisms could extract the core idea and try it, but the paper is not yet at the stage where it would shift practice. It deserves peer review because the problem is real and the proposed lever is simple enough to warrant checking with proper experiments.

Referee Report

2 major / 2 minor

Summary. The paper proposes MODIX, a training-free framework for Vision-Language Models that dynamically rescales positional indices based on multimodal information density. It computes intra-modal density via covariance-based entropy and inter-modal interactions via cross-modal alignment to derive unified scores; these scores adjust positional strides to allocate finer granularity to informative content and compress redundant regions, improving attention allocation and downstream multimodal reasoning without any changes to model parameters or architecture. Experiments are claimed to show consistent gains across architectures and benchmarks.

Significance. If the empirical improvements and adaptive reallocation hold under rigorous testing, this would represent a meaningful contribution by treating positional encoding as an adaptive, information-driven resource rather than a fixed uniform assignment. The training-free and architecture-agnostic design is a clear strength, enabling broad applicability and avoiding the computational cost of retraining. The joint modeling of intra- and inter-modal metrics offers a principled way to address inefficiencies in standard transformer positional encodings for multimodal sequences.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: The manuscript asserts consistent improvements across architectures and benchmarks but provides no quantitative results, error bars, ablation studies, baseline comparisons, or statistical significance tests in the summary description. This absence makes it impossible to verify the magnitude, reliability, or task-dependence of the claimed gains, which is load-bearing for the central empirical claim.
[Method] Method section: The core assumption that covariance-based entropy and cross-modal alignment scores accurately proxy task-relevant information density (and that rescaling will improve attention without artifacts) is not validated through correlation analyses, ablations on score components, or failure-case studies. This directly underpins whether the derived unified scores reliably drive beneficial positional adjustments.

minor comments (2)

[Method] Clarify the exact mathematical formulation of the unified score (e.g., how entropy and alignment are normalized and combined) with an explicit equation and pseudocode for reproducibility.
[Figures] Ensure all figures illustrating the MODIX pipeline include clear labels for input modalities, score computation steps, and output positional indices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of treating positional encoding as an adaptive resource in VLMs. We address each major comment below and will revise the manuscript to strengthen the empirical support and validation of the core assumptions.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The manuscript asserts consistent improvements across architectures and benchmarks but provides no quantitative results, error bars, ablation studies, baseline comparisons, or statistical significance tests in the summary description. This absence makes it impossible to verify the magnitude, reliability, or task-dependence of the claimed gains, which is load-bearing for the central empirical claim.

Authors: We agree that the abstract does not contain specific quantitative metrics. In the revised manuscript, we will update the abstract to include key results such as average performance gains across benchmarks and architectures. We will also expand the Experiments section to report error bars from multiple runs, detailed ablation studies, comparisons against relevant baselines, and statistical significance tests to rigorously substantiate the claims of consistent improvements and task-dependent reallocation. revision: yes
Referee: [Method] Method section: The core assumption that covariance-based entropy and cross-modal alignment scores accurately proxy task-relevant information density (and that rescaling will improve attention without artifacts) is not validated through correlation analyses, ablations on score components, or failure-case studies. This directly underpins whether the derived unified scores reliably drive beneficial positional adjustments.

Authors: We acknowledge that direct validation of the proxy metrics is necessary. In the revision, we will add correlation analyses linking the covariance-based entropy and cross-modal alignment scores to task-relevant information density or downstream performance, component-wise ablations to isolate the contribution of each term in the unified score, and failure-case analyses to identify scenarios where rescaling may introduce artifacts or fail to improve attention allocation. These additions will clarify the reliability of the approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes MODIX as a training-free rescaling of positional indices derived from covariance-based intra-modal entropy and cross-modal alignment scores. No equations, derivations, or self-citations are shown in the abstract or description that reduce the claimed performance gains to a fitted parameter, self-referential definition, or load-bearing prior result by the same authors. The central claim relies on externally motivated statistical proxies for information density without visible reduction to its own inputs by construction. The derivation is therefore self-contained against the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven premise that information density can be reliably quantified by covariance entropy plus alignment and that positional rescaling will produce net gains in attention quality; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Uniform positional indexing is suboptimal for multimodal sequences with varying information density
This premise is stated directly in the abstract as the motivation for the work.

pith-pipeline@v0.9.0 · 5469 in / 1266 out tokens · 42852 ms · 2026-05-10T14:57:05.380669+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 5 canonical work pages · 4 internal anchors

[1]

Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Kar ´en Simonyan

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L. Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj B...

2022
[2]

Paech, Paul Pak, Rom N

Alexander Amini, Anna Banaszak, Harold Benoit, Arthur B¨o¨ok, Tarek Dakhran, Song Duong, Alfred Eng, Fer- nando Fernandes, Marc H¨ark¨onen, Anne Harrington, Ramin Hasani, Saniya Karwa, Yuri Khrustalev, Maxime Labonne, Mathias Lechner, Valentine Lechner, Simon Lee, Zetian Li, Noel Loo, Jacob Marks, Edoardo Mosca, Samuel J. Paech, Paul Pak, Rom N. Parnichku...

2025
[3]

Devon Hjelm, and William Buchwalter

Philip Bachman, R. Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Pro- cessing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 15509–15519, 2019. 2

2019
[4]

Qwen3-VL technical report, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

2025
[5]

To- ken merging: Your vit but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster. InThe Eleventh Interna- tional Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 2

2023
[6]

Extending context window of large language models via positional interpolation, 2023

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuan- dong Tian. Extending context window of large language models via positional interpolation, 2023. 2

2023
[7]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Informa- tion Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, 2006. 3

2006
[8]

Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 2978–

2019
[9]

Association for Computational Linguistics, 2019. 2

2019
[10]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 2

2019
[11]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Rep- resentations, ICLR 202...

2021
[12]

OpenReview.net, 2021. 2, 6

2021
[13]

Learning robust representations via multi-view information bottleneck

Marco Federici, Anjan Dutta, Patrick Forr ´e, Nate Kush- man, and Zeynep Akata. Learning robust representations via multi-view information bottleneck. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. 2

2020
[14]

Video-MME: The first-ever comprehensive evaluation benchmark of multi- modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-MME: The first-ever comprehensive evaluation benchmark of multi- modal llms in video analysis. In...

2025
[15]

Smith, Wei-Chiu Ma, and Ranjay Krishna

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. BLINK: multimodal large language mod- els can see but not perceive. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XXIII, pages 148–

2024
[16]

Springer, 2024. 2, 6

2024
[17]

V2pe: Improving multi- modal long-context capability of vision-language models with variable visual position encoding

Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, and Xizhou Zhu. V2pe: Improving multi- modal long-context capability of vision-language models with variable visual position encoding. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 21070–21084, 2025. 2, 6

2025
[18]

Trusted multi-view classification with dynamic evi- dential fusion.IEEE Trans

Zongbo Han, Changqing Zhang, Huazhu Fu, and Joey Tianyi Zhou. Trusted multi-view classification with dynamic evi- dential fusion.IEEE Trans. Pattern Anal. Mach. Intell., 45 (2):2551–2566, 2023. 2

2023
[19]

MISA: modality-invariant and -specific representa- tions for multimodal sentiment analysis

Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. MISA: modality-invariant and -specific representa- tions for multimodal sentiment analysis. InMM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, pages 1122–

2020
[20]

Revisiting Multimodal Positional Encoding in Vision-Language Models

Jie Huang, Xuejing Liu, Sibo Song, Ruibing Hou, Hong Chang, Junyang Lin, and Shuai Bai. Revisiting multi- modal positional encoding in vision-language models.arXiv preprint arXiv:2510.23095, 2025. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Min Joon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InComputer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, pages 235–251. Springer, 2016. 2, 6

2016
[22]

Amirul Islam, Neil D

Matthew Kowal, Mennatullah Siam, Md. Amirul Islam, Neil D. B. Bruce, Richard P. Wildes, and Konstantinos G. Derpa- nis. Quantifying and learning static vs. dynamic information in deep spatiotemporal networks.IEEE Trans. Pattern Anal. Mach. Intell., 47(1):190–205, 2025. 2

2025
[23]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pages 19730– 19742. PMLR, 2023. 2

2023
[24]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Informa- tion Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. 1, 2, 5

2023
[25]

Learn to explain: Multimodal reason- ing via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reason- ing via thought chains for science question answering. In Advances in Neural Information Processing Systems, pages 2507–2521, 2022. 2, 5

2022
[26]

Joty, and Enamul Hoque

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq R. Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 2263–

2022
[27]

Association for Computational Linguistics, 2022. 2, 6

2022
[28]

Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for VQA on document images. In IEEE Winter Conference on Applications of Computer Vi- sion, WACV 2021, Waikoloa, HI, USA, January 3-8, 2021, pages 2199–2208. IEEE, 2021. 2, 6

2021
[29]

YaRN: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 2

2024
[30]

Smith, and Mike Lewis

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length ex- trapolation. InThe Tenth International Conference on Learn- ing Representations, ICLR 2022, Virtual Event, April 25-29,

2022
[31]

OpenReview.net, 2022. 2, 5, 8

2022
[32]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, ...

2021
[33]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learn- ing Research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learn- ing Research, 21(140):1–67, 2020. 2

2020
[34]

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification. InAd- vances in Neural Information Processing Systems 34: An- nual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 13937–13949, 2021. 2

2021
[35]

Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced trans- former with rotary position embedding.Neurocomputing, 568:127063, 2024. 1, 2, 4, 5

2024
[36]

The information bottleneck method

Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057, 2000. 2

work page internal anchor Pith review arXiv 2000
[37]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[38]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017. 1, 2

2017
[39]

Circle-rope: Cone-like decoupled rotary posi- tional embedding for large vision-language models.arXiv preprint arXiv:2505.16416, 2025a

Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, and Kai Han. Circle-RoPE: Cone-like decoupled rotary positional embedding for large vision-language models.arXiv preprint arXiv:2505.16416,

work page arXiv
[40]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 1, 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

RealWorldQA, 2024

xAI. RealWorldQA, 2024. 2, 5

2024
[42]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jia- long Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang...

2025
[43]

´Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov

Hongxu Yin, Arash Vahdat, Jos ´e M. ´Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-ViT: Adaptive tokens for efficient vision transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10799–10808. IEEE, 2022. 2

2022