pith. sign in

arxiv: 2604.12767 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.AI

CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models

Pith reviewed 2026-05-10 14:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multimodal large language modelsvisual token reductionlayer fusiondual-stage pruningclass-adaptivevision transformerstoken pruning
0
0 comments X

The pith

CLASP adapts multi-layer vision fusion to input categories and prunes tokens in two stages to cut visual redundancy in multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CLASP as a plug-and-play method to lower the computational cost of visual token sequences in multimodal large language models. Existing single-layer and static pruning approaches often break down when instructions vary. CLASP instead fuses features from multiple vision layers in a way that depends on the input category, then splits the pruning into two stages that separately pick attention-important pivot tokens and add redundancy-aware completion tokens. This prompt-conditioned allocation supports more aggressive reduction while aiming to keep performance stable. If the approach holds, models could run faster on a broader set of tasks and architectures without custom retraining.

Core claim

CLASP first constructs category-specific visual representations through multi-layer vision feature fusion. It then performs dual-stage pruning, allocating the token budget between attention-salient pivot tokens for relevance and redundancy-aware completion tokens for coverage. Through class-adaptive pruning, CLASP enables prompt-conditioned feature fusion and budget allocation, allowing aggressive yet robust visual token reduction.

What carries the argument

Class-adaptive layer fusion paired with dual-stage pruning that first builds category-specific representations and then splits the token budget between relevance-focused pivot tokens and coverage-focused completion tokens.

If this is right

  • CLASP achieves higher accuracy than prior token reduction techniques at the same pruning ratios across multiple benchmarks.
  • The method maintains performance when applied to different multimodal large language model architectures.
  • Prompt-conditioned adaptation improves robustness compared with fixed fusion and pruning strategies.
  • Token budgets can be split to balance relevance and coverage without manual per-task tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same category-driven fusion idea could extend to pruning other sequence data such as text or audio tokens inside large models.
  • Widespread adoption might enable running high-capacity multimodal models on hardware with tighter memory or latency limits.
  • The dual-stage split suggests a general template for separating importance sampling from diversity sampling in other compression pipelines.

Load-bearing premise

Category-specific multi-layer fusion combined with the two-stage split will preserve enough critical visual information across arbitrary instructions and inputs.

What would settle it

A benchmark result in which CLASP-pruned models produce lower accuracy than static single-layer baselines on prompts that require fine details from specific vision layers or low-attention tokens.

Figures

Figures reproduced from arXiv: 2604.12767 by Qi Fan, Wenbin Li, Yang Gao, Yifan Jiang, Yinghuan Shi, Yizhu Jiang, Yunkai Dang.

Figure 1
Figure 1. Figure 1: Impact of hyperparameter settings on MMVet dataset per￾formance (LLaVA-v1.5-7B, 192 retained tokens). The heatmaps illustrate the score distribution across different question categories under varying conditions. Top: Evaluation of five representative layer-fusion strategies (A–E), ordered by an increasing proportion of weights assigned to deeper layers (i.e., shifting from shallow in A to deep in E). Botto… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our method. The framework utilizes a prompt-to-class router to condition visual processing on textual intent. (i) Class-Adaptive Layer Fusion: ViT features are aggregated from multiple layers using class-specific mixture weights to capture an appropriate level of visual abstraction. (ii) Class-Adaptive Pruning: The projected visual token budget R is dynamically split (ratio a) between attention… view at source ↗
Figure 3
Figure 3. Figure 3: Example visualization of the original image and the corresponding token-retention map. mation for visual understanding, ensuring robust alignment for downstream tasks. 6. Conclusion In this paper, we proposed CLASP, a framework synergiz￾ing class-adaptive layer fusion with dual-stage pruning to balance token relevance and spatial coverage. Our results demonstrate that dynamic reduction minimizes redundancy… view at source ↗
Figure 4
Figure 4. Figure 4: Per-benchmark performance under increasing token pruning. We plot performance as a function of pruning ratio on eight evaluation suites, comparing our method with representative pruning baselines (SparseVLM and PDrop). The dashed line denotes the unpruned model performance. Our method exhibits the slowest degradation and stays consistently closest to the unpruned upper bound, particularly at aggressive pru… view at source ↗
Figure 5
Figure 5. Figure 5: Layer mixture ablation under a fixed token budget (R = 192) on MME, TextVQA and SQA. Rows are layer mixture strategies with weights in parentheses. Columns are question types: C0 object identification, C1 attribute or breed identification, C2 text or symbol recognition, C3 scene understanding, C4 spatial relations, C5 counting, C6 action or interaction, C7 intention or function, C8 default. 34 [PITH_FULL_… view at source ↗
Figure 6
Figure 6. Figure 6: Ablation on attention and similarity mixture weight under a fixed token budget (R = 192) on MME, TextVQA and SQA. Rows are the mixture ratios. Columns are question types: C0 object identification, C1 attribute or breed identification, C2 text or symbol recognition, C3 scene understanding, C4 spatial relations, C5 counting, C6 action or interaction, C7 intention or function, C8 default. 35 [PITH_FULL_IMAGE… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparisons on Cases 1–4. Each case visualizes the original image and the corresponding token-retention map results under three pruning ratios (R = 66.7%, 77.8%, 88.9%) at layers 2, 6, and 15. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparisons on Cases 5–8. Each case visualizes the original image and the corresponding token-retention map results under three pruning ratios (R = 66.7%, 77.8%, 88.9%) at layers 2, 6, and 15. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) suffer from substantial computational overhead due to the high redundancy in visual token sequences. Existing approaches typically address this issue using single-layer Vision Transformer (ViT) features and static pruning strategies. However, such fixed configurations are often brittle under diverse instructions. To overcome these limitations, we propose CLASP, a plug-and-play token reduction framework based on class-adaptive layer fusion and dual-stage pruning. Specifically, CLASP first constructs category-specific visual representations through multi-layer vision feature fusion. It then performs dual-stage pruning, allocating the token budget between attention-salient pivot tokens for relevance and redundancy-aware completion tokens for coverage. Through class-adaptive pruning, CLASP enables prompt-conditioned feature fusion and budget allocation, allowing aggressive yet robust visual token reduction. Extensive experiments demonstrate that CLASP consistently outperforms existing methods across a wide range of benchmarks, pruning ratios, and MLLM architectures. Code will be available at https://github.com/Yunkaidang/CLASP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes CLASP, a plug-and-play token reduction framework for Multimodal Large Language Models (MLLMs) that addresses high redundancy in visual token sequences. It introduces class-adaptive multi-layer fusion of Vision Transformer features to build category-specific representations, followed by dual-stage pruning that allocates a token budget between attention-salient pivot tokens (for relevance) and redundancy-aware completion tokens (for coverage). The approach enables prompt-conditioned fusion and budget allocation. Extensive experiments are reported to demonstrate consistent outperformance over prior methods across benchmarks, pruning ratios, and multiple MLLM architectures, with code to be released.

Significance. If the experimental results hold, CLASP represents a practical advance in efficient MLLM inference by mitigating the brittleness of single-layer and static pruning strategies under diverse instructions. The class-adaptive mechanism and dual-stage design provide a flexible, architecture-agnostic solution for aggressive visual token reduction without critical performance loss. Reproducibility is strengthened by the promised code release and the reported ablations and cross-architecture evaluations.

minor comments (3)
  1. [§3.2] §3.2: The selection criteria for pivot tokens (attention salience) and completion tokens (redundancy awareness) are described in prose but would benefit from an explicit equation or pseudocode to clarify the budget allocation step.
  2. [Table 3] Table 3 and Figure 4: Standard deviations or error bars are not shown for the reported accuracy and efficiency metrics; including them would better support the claim of consistent outperformance across pruning ratios.
  3. [§4.3] §4.3: The ablation study on layer fusion depth is informative but lacks a direct comparison to a non-class-adaptive baseline within the same table, which would isolate the contribution of the adaptive component.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of CLASP, recognition of its practical contributions to efficient MLLM inference, and recommendation for minor revision. The referee's assessment aligns well with the manuscript's claims regarding class-adaptive fusion and dual-stage pruning.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces CLASP as a plug-and-play framework combining class-adaptive multi-layer vision feature fusion with dual-stage pruning (pivot and completion tokens). No equations, derivations, or first-principles results are shown that reduce claimed performance gains to quantities defined by the method's own fitted parameters, self-referential normalizations, or self-citation chains. The central claims rest on experimental benchmarks across pruning ratios and architectures rather than internal definitions or imported uniqueness theorems. The method description uses standard components without smuggling ansatzes via prior self-citations or renaming known empirical patterns as novel unifications. This is a standard empirical method paper whose results are externally falsifiable via the promised code and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, axioms, or invented entities; the framework is described as a plug-and-play extension of existing ViT feature extraction and pruning ideas without new postulated quantities.

pith-pipeline@v0.9.0 · 5497 in / 1174 out tokens · 52003 ms · 2026-05-10T14:51:43.277882+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

  1. [1]

    Routing and mixture weights.Given the prompt xn, a text-only router predicts a class cn, which selects a row wcn ∈R L from the layer-score matrixW. We then convert it into a probability distribution over layers: cn ←Route(x n),α n ≜softmax(τw cn)∈∆ L−1.(25) Hereτ >0is a temperature that controls how “peaky” the layer preference is

  2. [2]

    out-of-manifold

    Token-wise convex fusion across layers.For each token index t∈ V n, we fuse its layer-wise representations by a weighted sum. This dynamically integrates features from varying levels of visual abstraction: ¯zn,t = LX l=1 αn,lz(l) n,t.(26) 3.Projection into the decoder space.The fused token is mapped to the decoder embedding space throughf proj: ˜zn,t =f p...

  3. [3]

    relevance-first

    Budget split into relevance pivots vs. diversity completion.Given a class-dependent ratio an ∈[0,1] and total budgetR, determining the specific allocation size for the relevance and diversity stages: K1 =⌊a nR⌋, K 2 =R−K 1.(34) 3.Stage I: pivots by top-K 1 saliency.We keep theK 1 most salient tokens as pivots: Pn = TopK1 {ϕn,t}t∈Vn .(35) This stage is “re...

  4. [4]

    Using unit features un,t (Eq

    Pivot-relative redundancy.Let Un =V n \ Pn be the non-pivot pool. Using unit features un,t (Eq. (24)), define redundancy of a candidate tokent∈ U n as ρn,t ≜max j∈Pn u⊤ n,tun,j, t∈ U n.(36) Because un,t are unit vectors, ρn,t ∈[−1,1] is cosine similarity. Large ρn,t means t is highly duplicated by some pivot; smallρ n,t meanstlies in a direction poorly co...

  5. [5]

    To make our seeding rule explicit, for any seed setC ⊆ U n with|C|=K 2, we define the total pivot-relative redundancy as D(C | P n)≜ X t∈C ρn,t.(37) Since Eq

    Deterministic redundancy-aware seeding for Stage II.We view ρn,t as apivot-overlap cost: it measures how much a candidate token t∈ U n resembles the already-selected pivot set Pn (worst-case cosine overlap). To make our seeding rule explicit, for any seed setC ⊆ U n with|C|=K 2, we define the total pivot-relative redundancy as D(C | P n)≜ X t∈C ρn,t.(37) ...

  6. [6]

    Bottom-K2 seeding (optimal for D).The minimizer of Eq. (37) is obtained by selecting the K2 least redundant tokens, since the sum is minimized by the smallest individual terms: C(0) n = BottomK2 {ρn,t}t∈Un .(38) We use the corresponding unit features as initial sphericalK-means centers,i.e., setµ (0) n,k =u n,ck forc k ∈ C (0) n . 22 CLASP: Class-Adaptive...

  7. [7]

    (26)).Computing {¯zn,t}t∈Vn costs O(L Mn dv): a single weighted sum over cached layer outputs

    Layer fusion (Eq. (26)).Computing {¯zn,t}t∈Vn costs O(L Mn dv): a single weighted sum over cached layer outputs. 2.Saliency top-K 1 (Eq.(35)).SelectingTop K1 costsO(M n logK 1)via partial sort / heap

  8. [8]

    (36)).Naively, computing ρn,t = max j∈Pn u⊤ n,tun,j for all t∈ U n costs O((Mn −K 1)K1d)

    Redundancy computation for seeding (Eq. (36)).Naively, computing ρn,t = max j∈Pn u⊤ n,tun,j for all t∈ U n costs O((Mn −K 1)K1d). This can be implemented as a matrix multiplication between UU ∈R (Mn−K1)×d and UP ∈R K1×d followed by a row-wise max, yielding the required redundancy values. 4.Bottom-K 2 seeding (Eq.(38)).SelectingBottom K2 costsO((M n −K 1) logK 2)

  9. [9]

    (39)–(40)).Each iteration costs O((Mn −K 1)K2d) for similarity evalu- ation/assignment plus O((Mn −K 1)d) for accumulating cluster sums and normalization

    Spherical K-means refinement (Eqs. (39)–(40)).Each iteration costs O((Mn −K 1)K2d) for similarity evalu- ation/assignment plus O((Mn −K 1)d) for accumulating cluster sums and normalization. Over T iterations, the refinement cost isO(T(M n −K 1)K2d), which remains efficient for small iteration counts. 6.Medoid selection (Eq.(41)).Computing similarities of ...