pith. machine review for the scientific record. sign in

arxiv: 2510.23095 · v3 · submitted 2025-10-27 · 💻 cs.CV

Recognition: unknown

Revisiting Multimodal Positional Encoding in Vision-Language Models

Authors on Pith no claims yet
classification 💻 cs.CV
keywords multimodalencodingpositionpositionalfrequencymodelsropevision-language
0
0 comments X
read the original abstract

Multimodal position encoding is essential for vision-language models, yet there has been little systematic investigation into multimodal position encoding. We conduct a comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) by examining its two core components: position design and frequency allocation. Through extensive experiments, we identify three key guidelines: positional coherence, full frequency utilization, and preservation of textual priors-ensuring unambiguous layout, rich representation, and faithful transfer from the pre-trained LLM. Based on these insights, we propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and plug-and-play variants that require no architectural changes. Our methods consistently outperform existing approaches across diverse benchmarks, with significant improvements in both general and fine-grained multimodal understanding. Code will be avaliable at https://github.com/JJJYmmm/Multimodal-RoPEs.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. InCaRPose: In-Cabin Relative Camera Pose Estimation Model and Dataset

    cs.CV 2026-04 unverdicted novelty 7.0

    InCaRPose is a Transformer-based model trained on synthetic data that predicts absolute metric-scale relative poses between distorted in-cabin camera views and generalizes to real images while releasing a new test dataset.

  2. MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    MODIX dynamically rescales positional indices in VLMs using intra-modal covariance-based entropy and inter-modal alignment scores to allocate finer granularity to informative content.

  3. DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...

  4. Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities

    cs.CV 2026-04 unverdicted novelty 5.0

    Introduces MAF framework and DeepModal-Bench to capture universal cross-modal forgery traces for better generalization in multimodal deepfake detection.

  5. Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers

    cs.CV 2026-05 unverdicted novelty 4.0

    Human visual interestingness is linearly decodable from final-layer embeddings in Qwen3-VL-8B and becomes progressively more structured across vision and language layers without explicit supervision.

  6. Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

    cs.CV 2026-04 unverdicted novelty 3.0

    Wan-Image is a unified multi-modal system that integrates LLMs and diffusion transformers to deliver professional-grade image generation features including complex typography, multi-subject consistency, and precise ed...