pith. machine review for the scientific record. sign in

arxiv: 2510.12184 · v2 · submitted 2025-10-14 · 💻 cs.CV · cs.AI

Recognition: unknown

CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs

Authors on Pith no claims yet
classification 💻 cs.CV cs.AI
keywords visualstudentattentioncompodistillteacherabilitiesperceptioncompositional
0
0 comments X
read the original abstract

Recently, efficient Multimodal Large Language Models (MLLMs) have gained significant attention as a solution to their high computational complexity, making them more practical for real-world applications. In this regard, the knowledge distillation (KD) approach has emerged as a promising alternative, which transfers the rich visual and linguistic knowledge from a larger model (teacher) to a smaller model (student). However, we observe that existing KD methods struggle to effectively distill the teacher MLLM's rich visual perception abilities to the student, a challenge that has been largely overlooked in previous studies. Through a systematic analysis, we identify visual attention misalignment between student and teacher as the main cause of this issue. Based on this insight, we propose CompoDistill, a novel KD framework that explicitly aligns the student's visual attention with that of the teacher to enhance the student's visual perception abilities. Our extensive experiments show that CompoDistill significantly improves performance on compositional reasoning tasks that require visual perception abilities while maintaining strong performance on visual question answering tasks, as done in existing studies. Furthermore, CompoDistill demonstrates effectiveness with a more advanced backbone, highlighting its generalizability.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

    cs.LG 2026-04 unverdicted novelty 7.0

    Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning,...

  2. Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

    cs.CV 2026-04 unverdicted novelty 7.0

    Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

  3. Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

    cs.CL 2026-04 conditional novelty 5.0

    Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.