pith. sign in

arxiv: 2405.17430 · v2 · pith:DNTK7LWWnew · submitted 2024-05-27 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Matryoshka Multimodal Models

classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords visualtokenslargemodelsmatryoshkamultimodalnumberapproach
0
0 comments X
read the original abstract

Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While token pruning/merging methods do exist, they produce a single length output for each image and do not afford flexibility in trading off information density v.s. efficiency. Inspired by the concept of Matryoshka Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities. Our approach offers several unique benefits for LMMs: (1) One can explicitly control the visual granularity per test instance during inference, e.g. , adjusting the number of tokens used to represent an image based on the anticipated complexity or simplicity of the content; (2) M3 provides a framework for analyzing the granularity needed for existing datasets, where we find that COCO-style benchmarks only need around ~9 visual tokens to obtain accuracy similar to that of using all 576 tokens; (3) Our approach provides a foundation to explore the best trade-off between performance and visual token length at sample level, where our investigation reveals that a large gap exists between the oracle upper bound and current fixed-scale representations.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

    cs.CV 2026-06 conditional novelty 7.0

    Reroute turns irreversible visual-token pruning into recoverable routing that reuses existing attention scores, improving grounding performance under aggressive reduction on LLaVA-1.5 and Qwen while preserving TFLOPs ...

  2. EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

  3. GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations

    cs.CV 2026-05 unverdicted novelty 6.0

    GraSP-VL turns frozen VLM embedding length into a controllable semantic granularity interface via a learned shared prefix transform that creates a Semantic Matryoshka structure.

  4. WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

    cs.CV 2026-05 unverdicted novelty 6.0

    WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.

  5. MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining

    cs.CL 2026-04 unverdicted novelty 6.0

    MIPIC trains nested Matryoshka representations via self-distilled intra-relational alignment with top-k CKA and progressive information chaining across depths, yielding competitive performance especially at extreme lo...

  6. HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.

  7. MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining

    cs.CL 2026-04 unverdicted novelty 5.0

    MIPIC trains Matryoshka representations using self-distilled intra-relational alignment and progressive information chaining, yielding competitive results on STS, NLI, and classification tasks especially at low dimensions.