CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models
Pith reviewed 2026-05-10 14:51 UTC · model grok-4.3
The pith
CLASP adapts multi-layer vision fusion to input categories and prunes tokens in two stages to cut visual redundancy in multimodal models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CLASP first constructs category-specific visual representations through multi-layer vision feature fusion. It then performs dual-stage pruning, allocating the token budget between attention-salient pivot tokens for relevance and redundancy-aware completion tokens for coverage. Through class-adaptive pruning, CLASP enables prompt-conditioned feature fusion and budget allocation, allowing aggressive yet robust visual token reduction.
What carries the argument
Class-adaptive layer fusion paired with dual-stage pruning that first builds category-specific representations and then splits the token budget between relevance-focused pivot tokens and coverage-focused completion tokens.
If this is right
- CLASP achieves higher accuracy than prior token reduction techniques at the same pruning ratios across multiple benchmarks.
- The method maintains performance when applied to different multimodal large language model architectures.
- Prompt-conditioned adaptation improves robustness compared with fixed fusion and pruning strategies.
- Token budgets can be split to balance relevance and coverage without manual per-task tuning.
Where Pith is reading between the lines
- The same category-driven fusion idea could extend to pruning other sequence data such as text or audio tokens inside large models.
- Widespread adoption might enable running high-capacity multimodal models on hardware with tighter memory or latency limits.
- The dual-stage split suggests a general template for separating importance sampling from diversity sampling in other compression pipelines.
Load-bearing premise
Category-specific multi-layer fusion combined with the two-stage split will preserve enough critical visual information across arbitrary instructions and inputs.
What would settle it
A benchmark result in which CLASP-pruned models produce lower accuracy than static single-layer baselines on prompts that require fine details from specific vision layers or low-attention tokens.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) suffer from substantial computational overhead due to the high redundancy in visual token sequences. Existing approaches typically address this issue using single-layer Vision Transformer (ViT) features and static pruning strategies. However, such fixed configurations are often brittle under diverse instructions. To overcome these limitations, we propose CLASP, a plug-and-play token reduction framework based on class-adaptive layer fusion and dual-stage pruning. Specifically, CLASP first constructs category-specific visual representations through multi-layer vision feature fusion. It then performs dual-stage pruning, allocating the token budget between attention-salient pivot tokens for relevance and redundancy-aware completion tokens for coverage. Through class-adaptive pruning, CLASP enables prompt-conditioned feature fusion and budget allocation, allowing aggressive yet robust visual token reduction. Extensive experiments demonstrate that CLASP consistently outperforms existing methods across a wide range of benchmarks, pruning ratios, and MLLM architectures. Code will be available at https://github.com/Yunkaidang/CLASP.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CLASP, a plug-and-play token reduction framework for Multimodal Large Language Models (MLLMs) that addresses high redundancy in visual token sequences. It introduces class-adaptive multi-layer fusion of Vision Transformer features to build category-specific representations, followed by dual-stage pruning that allocates a token budget between attention-salient pivot tokens (for relevance) and redundancy-aware completion tokens (for coverage). The approach enables prompt-conditioned fusion and budget allocation. Extensive experiments are reported to demonstrate consistent outperformance over prior methods across benchmarks, pruning ratios, and multiple MLLM architectures, with code to be released.
Significance. If the experimental results hold, CLASP represents a practical advance in efficient MLLM inference by mitigating the brittleness of single-layer and static pruning strategies under diverse instructions. The class-adaptive mechanism and dual-stage design provide a flexible, architecture-agnostic solution for aggressive visual token reduction without critical performance loss. Reproducibility is strengthened by the promised code release and the reported ablations and cross-architecture evaluations.
minor comments (3)
- [§3.2] §3.2: The selection criteria for pivot tokens (attention salience) and completion tokens (redundancy awareness) are described in prose but would benefit from an explicit equation or pseudocode to clarify the budget allocation step.
- [Table 3] Table 3 and Figure 4: Standard deviations or error bars are not shown for the reported accuracy and efficiency metrics; including them would better support the claim of consistent outperformance across pruning ratios.
- [§4.3] §4.3: The ablation study on layer fusion depth is informative but lacks a direct comparison to a non-class-adaptive baseline within the same table, which would isolate the contribution of the adaptive component.
Simulated Author's Rebuttal
We thank the referee for their positive summary of CLASP, recognition of its practical contributions to efficient MLLM inference, and recommendation for minor revision. The referee's assessment aligns well with the manuscript's claims regarding class-adaptive fusion and dual-stage pruning.
Circularity Check
No significant circularity detected
full rationale
The paper introduces CLASP as a plug-and-play framework combining class-adaptive multi-layer vision feature fusion with dual-stage pruning (pivot and completion tokens). No equations, derivations, or first-principles results are shown that reduce claimed performance gains to quantities defined by the method's own fitted parameters, self-referential normalizations, or self-citation chains. The central claims rest on experimental benchmarks across pruning ratios and architectures rather than internal definitions or imported uniqueness theorems. The method description uses standard components without smuggling ansatzes via prior self-citations or renaming known empirical patterns as novel unifications. This is a standard empirical method paper whose results are externally falsifiable via the promised code and benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Routing and mixture weights.Given the prompt xn, a text-only router predicts a class cn, which selects a row wcn ∈R L from the layer-score matrixW. We then convert it into a probability distribution over layers: cn ←Route(x n),α n ≜softmax(τw cn)∈∆ L−1.(25) Hereτ >0is a temperature that controls how “peaky” the layer preference is
-
[2]
Token-wise convex fusion across layers.For each token index t∈ V n, we fuse its layer-wise representations by a weighted sum. This dynamically integrates features from varying levels of visual abstraction: ¯zn,t = LX l=1 αn,lz(l) n,t.(26) 3.Projection into the decoder space.The fused token is mapped to the decoder embedding space throughf proj: ˜zn,t =f p...
-
[3]
Budget split into relevance pivots vs. diversity completion.Given a class-dependent ratio an ∈[0,1] and total budgetR, determining the specific allocation size for the relevance and diversity stages: K1 =⌊a nR⌋, K 2 =R−K 1.(34) 3.Stage I: pivots by top-K 1 saliency.We keep theK 1 most salient tokens as pivots: Pn = TopK1 {ϕn,t}t∈Vn .(35) This stage is “re...
-
[4]
Pivot-relative redundancy.Let Un =V n \ Pn be the non-pivot pool. Using unit features un,t (Eq. (24)), define redundancy of a candidate tokent∈ U n as ρn,t ≜max j∈Pn u⊤ n,tun,j, t∈ U n.(36) Because un,t are unit vectors, ρn,t ∈[−1,1] is cosine similarity. Large ρn,t means t is highly duplicated by some pivot; smallρ n,t meanstlies in a direction poorly co...
-
[5]
Deterministic redundancy-aware seeding for Stage II.We view ρn,t as apivot-overlap cost: it measures how much a candidate token t∈ U n resembles the already-selected pivot set Pn (worst-case cosine overlap). To make our seeding rule explicit, for any seed setC ⊆ U n with|C|=K 2, we define the total pivot-relative redundancy as D(C | P n)≜ X t∈C ρn,t.(37) ...
-
[6]
Bottom-K2 seeding (optimal for D).The minimizer of Eq. (37) is obtained by selecting the K2 least redundant tokens, since the sum is minimized by the smallest individual terms: C(0) n = BottomK2 {ρn,t}t∈Un .(38) We use the corresponding unit features as initial sphericalK-means centers,i.e., setµ (0) n,k =u n,ck forc k ∈ C (0) n . 22 CLASP: Class-Adaptive...
-
[7]
(26)).Computing {¯zn,t}t∈Vn costs O(L Mn dv): a single weighted sum over cached layer outputs
Layer fusion (Eq. (26)).Computing {¯zn,t}t∈Vn costs O(L Mn dv): a single weighted sum over cached layer outputs. 2.Saliency top-K 1 (Eq.(35)).SelectingTop K1 costsO(M n logK 1)via partial sort / heap
-
[8]
(36)).Naively, computing ρn,t = max j∈Pn u⊤ n,tun,j for all t∈ U n costs O((Mn −K 1)K1d)
Redundancy computation for seeding (Eq. (36)).Naively, computing ρn,t = max j∈Pn u⊤ n,tun,j for all t∈ U n costs O((Mn −K 1)K1d). This can be implemented as a matrix multiplication between UU ∈R (Mn−K1)×d and UP ∈R K1×d followed by a row-wise max, yielding the required redundancy values. 4.Bottom-K 2 seeding (Eq.(38)).SelectingBottom K2 costsO((M n −K 1) logK 2)
-
[9]
Spherical K-means refinement (Eqs. (39)–(40)).Each iteration costs O((Mn −K 1)K2d) for similarity evalu- ation/assignment plus O((Mn −K 1)d) for accumulating cluster sums and normalization. Over T iterations, the refinement cost isO(T(M n −K 1)K2d), which remains efficient for small iteration counts. 6.Medoid selection (Eq.(41)).Computing similarities of ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.