A GEMM-centric taxonomy and unified benchmark show static depth pruning as the strongest Pareto-optimal baseline for LLM inference acceleration, with the frontier shifting to dynamic depth then static width pruning as quality loss rises.
From Data to Model: A Survey of the Compression Lifecycle in MLLMs , url=
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2representative citing papers
miniReranker reduces multimodal reranking runtime to under 1% of the dense baseline under high-reuse conditions while retaining over 96% of performance via vision-first prompting, early exit, sparse cross-segment attention, and embedder-guided token pruning.
citing papers explorer
-
Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy
A GEMM-centric taxonomy and unified benchmark show static depth pruning as the strongest Pareto-optimal baseline for LLM inference acceleration, with the frontier shifting to dynamic depth then static width pruning as quality loss rises.
-
miniReranker: Efficient Multimodal Reranking through Visual Cache Reuse and Interaction Sparsity
miniReranker reduces multimodal reranking runtime to under 1% of the dense baseline under high-reuse conditions while retaining over 96% of performance via vision-first prompting, early exit, sparse cross-segment attention, and embedder-guided token pruning.