pith. machine review for the scientific record. sign in

arxiv: 2605.05899 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading

Authors on Pith no claims yet

Pith reviewed 2026-05-09 16:06 UTC · model grok-4.3

classification 💻 cs.LG
keywords VL-MoEmodel offloadingtoken pruningmixture of expertsvision-language modelsinference optimizationexpert localitymultimodal deployment
0
0 comments X

The pith

Pruning redundant visual tokens makes expert accesses in VL-MoE models more concentrated and stable, enabling up to 2.68x faster inference under memory limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that current MoE offloading methods, built for text, lose effectiveness on visual-heavy inputs because many visual tokens trigger wide and erratic expert use. VisMMoE shows that removing redundant visual tokens induces visual-expert affinity, concentrating accesses within layers and stabilizing them across layers to create a smaller, more predictable expert set. This affinity then supports targeted token compression, lookahead prediction, and orchestration that improve locality and prefetching. A reader would care because large vision-language models become far more practical on devices with tight memory budgets while accuracy stays competitive.

Core claim

VisMMoE establishes that pruning redundant visual tokens produces visual-expert affinity by making expert accesses more concentrated within layers and more stable across layers, yielding a smaller and more predictable expert working set. Guided by this, the system combines affinity-aware token compression, lookahead expert prediction, and cache/pipeline orchestration to raise expert locality and prefetch success under tight memory. Evaluations on multiple frameworks and representative VL-MoE models show end-to-end inference gains of up to 2.68x and 1.61x over strong baselines while accuracy remains competitive.

What carries the argument

visual-expert affinity: the effect in which pruning visual tokens concentrates expert accesses within layers and stabilizes them across layers to shrink the working set.

If this is right

  • End-to-end inference speeds improve by up to 2.68x over strong baselines on current VL-MoE deployments.
  • A second reported gain reaches 1.61x on additional workloads while accuracy stays competitive.
  • Expert locality and prefetch effectiveness both increase when memory budgets are tight.
  • The techniques apply across multiple implementation frameworks and standard VL-MoE models and benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pruning-driven affinity might extend to other multimodal MoE settings where one modality dominates token volume.
  • Hardware schedulers could adopt similar lookahead mechanisms if token pruning is applied upstream of the model.
  • Energy use on edge devices may drop further if the reduced expert working set lowers both compute and data movement.
  • Model designers might later embed pruning rules that preserve accuracy while deliberately maximizing this affinity.

Load-bearing premise

Pruning redundant visual tokens will reliably reshape expert demand into a smaller and more stable working set that compression, prediction, and orchestration can then exploit.

What would settle it

Direct measurement of expert activation patterns on a VL-MoE model before and after pruning that shows no meaningful drop in the number or variability of accessed experts per layer.

Figures

Figures reproduced from arXiv: 2605.05899 by Chao Li, Cheng Xu, Jiacheng Liu, Xiaofeng Hou.

Figure 1
Figure 1. Figure 1: Expert activation distribution of Qwen3-VL-30B￾A3B with 128 experts per layer. The x-axis is the expert ID, the y-axis is the layer ID, and darker color indicates a higher activation ratio. Text tokens exhibit strong locality, whereas raw visual tokens induce fragmented expert activa￾tions. After token pruning, the activation pattern becomes substantially more concentrated (+32% relative). 1 Introduction V… view at source ↗
Figure 2
Figure 2. Figure 2: Why prior MoE offloading assumptions break for VL-MoEs: text-centric workloads induce concentrated expert access, whereas visual-heavy VL-MoE inputs broaden the effective expert working set. Affinity. VisMMOE combines Affinity-Aware Token Com￾pressor with Compression-Guided Lookahead Predic￾tor to (i) prune redundant tokens while compacting the ex￾pert working set, and (ii) predict future expert demand ear… view at source ↗
Figure 3
Figure 3. Figure 3: Spatial impact of token pruning. (a) Pruning re￾dundant visual tokens increases expert-access concentration, improves top-𝑘 expert coverage and thus reduces prefill time. (b) This concentration translates into a smaller and more reusable expert working set: the number of inactive experts nearly doubles, on-demand expert loads decrease, and GPU expert hit rate improves during prefill. benchmarks. Thus, VL-M… view at source ↗
Figure 4
Figure 4. Figure 4: Temporal impact of token pruning. Pruning re￾dundant visual tokens increases inter-layer routing simi￾larity, making future expert demand more predictable for lookahead-based prefetching. pruning such redundant tokens increases top-𝑘 expert cov￾erage and concentrates routing onto a smaller expert subset. This concentration shrinks the effective on-demand expert set view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the VisMMOE architecture. In prefill, raw multimodal inputs are compressed by the Affinity-aware Token Compressor. The Compression-guided Lookahead Predictor produces phase-dependent priority signals: broad-demand ordering in prefill and hot-expert prediction in decode. Guided by these signals, the Expert Caching and Pipeline Orchestrator manages cache residency and overlaps expert transfer wit… view at source ↗
Figure 6
Figure 6. Figure 6: Architecture of the compression-guided looka￾head predictor. The predictor pools compressed hidden-state and visual summaries from the retained token set, combines them with routing history, and predicts a priority score over experts within a bounded lookahead window. set. Let Ikeep denote the retained visual-token indices after affinity-aware compression. We define: h (ℎ) 𝑙 = MP({h𝑙,𝑗 : 𝑗 ∈ Ikeep}), h (𝑣)… view at source ↗
Figure 7
Figure 7. Figure 7: Data layout of VisMMOE. appear in deeper layers of the lookahead window. At runtime, the scheduler uses the top-𝐵 entries of y𝑙 as an ordered prefetch candidate set: in prefill, it is a bounded priority list rather than an exact cover of future experts, whereas in decode it more directly tracks near-future hot experts. 3.3.3 Per-layer Rolling Prediction and Runtime Inte￾gration. The prediction module is in… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of End-to-End inference speed for VisMMOE and the SOTA approaches. VisMMOE achieves best performance on all tasks and hardware platforms in both prefill and decode stages view at source ↗
Figure 9
Figure 9. Figure 9: VisMMOE inference speed on Orin. Currently, sglang and ktransformers fails to support SSD swap on Orin. accelerate and ktransformer, creating two versions of Vis￾MMOE (VIS-A and VIS-K). All experiments are conducted under 50% pruning ratio. Currently, prior academic MoE offloading systems such as MoE-Infinity [46], MoE-Offloading [7], and HOBBIT [40] are primarily designed for text-centric MoE workloads an… view at source ↗
Figure 9
Figure 9. Figure 9: Unfortunately, sglang [52] and ktransformers [4] fail to support SSD swap on Orin with 32GB unified memory. They suffer severe NvMapMemAllocInternalTagged error in load weight stage. While VIS-A actually outperforms accel￾erate [12] more on Jetson due to the low SSD bandwidth in swapping, the extremely low decode speed and high prefill latency prevents it from practical usage. 4.3 Ablation Study 4.3.1 Spee… view at source ↗
Figure 13
Figure 13. Figure 13: VisMMOE compression and predictor costs. Pre￾sented in latency distribution. This empirical analysis confirms that VisMMOE’s offline se￾lection effectively balances prediction accuracy, initial la￾tency masking, and steady-state cache stability. 4.3.5 Prediction/Compression Cost. To assess the run￾time cost of our introduced modules, we profile the latency distribution of the Affinity-Aware Token Compress… view at source ↗
Figure 10
Figure 10. Figure 10: Layer-wise Hot Recall performance of the expert predictor on different datasets. The proposed predictor signif￾icantly outperforms the random baselines by 3.82x (predict￾30) to 7.9x (predict-10). Note that the VisMMOE predictor works well even for textvqa, which does not participate in training at all. Layers 0-9 are pinned on the GPU memory and do not require prediction view at source ↗
Figure 11
Figure 11. Figure 11: The effect of vi￾sual info in prediction view at source ↗
read the original abstract

Large-scale vision-language mixture-of-experts (VL-MoE) models provide strong multimodal capability, but efficient deployment on memory-constrained platforms remains difficult. Existing MoE offloading systems are largely designed for text-centric workloads and become much less effective for visual-heavy inputs, where large numbers of visual tokens induce broader and less predictable expert accesses. We present VisMMoE, a VL-MoE offloading system built on a single systems insight: pruning redundant visual tokens can improve offloading not only by reducing computation, but also by reshaping expert demand. We refer to this effect as \textit{visual-expert affinity}: token pruning makes expert accesses more concentrated within layers and more stable across layers, producing a smaller and more predictable expert working set. Guided by this insight, VisMMoE combines affinity-aware token compression, lookahead expert prediction, and cache/pipeline orchestration to improve expert locality and prefetch effectiveness under tight memory budgets. We implement VisMMoE on multiple frameworks and evaluate it on representative VL-MoE models and benchmarks. VisMMoE improves end-to-end inference performance by up to 2.68x and 1.61x, respectively, over strong baselines for today's VL-MoE deployments while maintaining competitive accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents VisMMoE, a VL-MoE offloading system based on the observation that pruning redundant visual tokens induces 'visual-expert affinity'—making expert accesses more concentrated within layers and stable across layers—thereby enabling affinity-aware token compression, lookahead expert prediction, and cache/pipeline orchestration. It claims up to 2.68× and 1.61× end-to-end inference speedups over strong baselines on representative VL-MoE models while maintaining competitive accuracy.

Significance. If the affinity effect is shown to be general rather than model- or task-specific and the speedups are attributable to the proposed mechanisms (rather than token reduction alone), the work would provide a useful extension of text-centric MoE offloading techniques to visual-heavy multimodal workloads, with potential impact on memory-constrained deployment of large VL models.

major comments (2)
  1. [Evaluation] The central claim attributes performance gains to the visual-expert affinity effect induced by token pruning. However, without ablations that isolate affinity-aware compression, lookahead prediction, and orchestration from simple token reduction (and without results across multiple VL-MoE variants differing in vision encoders or routing), it remains unclear whether the reported 2.68×/1.61× speedups are driven by the claimed insight or by reduced token count. The evaluation must quantify the affinity effect (e.g., via metrics on expert access concentration and stability) and test its robustness.
  2. [Abstract and Evaluation] The abstract states performance numbers and accuracy retention but supplies no experimental details, baselines, error bars, dataset sizes, or ablation results; the full manuscript must provide these to allow verification of the weakest assumption that pruning reliably reshapes expert demand into a smaller, stable working set exploitable under tight memory budgets.
minor comments (1)
  1. [Abstract] The abstract refers to 'multiple frameworks' and 'representative VL-MoE models and benchmarks' without naming them; this should be expanded for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper to strengthen the evaluation and clarify experimental details.

read point-by-point responses
  1. Referee: [Evaluation] The central claim attributes performance gains to the visual-expert affinity effect induced by token pruning. However, without ablations that isolate affinity-aware compression, lookahead prediction, and orchestration from simple token reduction (and without results across multiple VL-MoE variants differing in vision encoders or routing), it remains unclear whether the reported 2.68×/1.61× speedups are driven by the claimed insight or by reduced token count. The evaluation must quantify the affinity effect (e.g., via metrics on expert access concentration and stability) and test its robustness.

    Authors: We appreciate the referee's emphasis on isolating the contributions of our proposed mechanisms. In the revised manuscript, we have added new ablation studies that directly compare affinity-aware compression, lookahead prediction, and orchestration against a controlled baseline applying identical token pruning but using standard offloading without affinity exploitation. These ablations show that the additional speedups (beyond token reduction) stem from improved expert locality. We have also introduced quantitative metrics for the affinity effect, including expert usage entropy (for within-layer concentration) and cross-layer expert overlap ratios (for stability). For robustness, we have extended the evaluation to an additional VL-MoE variant with a distinct vision encoder and routing configuration, confirming consistent affinity benefits. We believe these changes demonstrate that the reported gains are attributable to the visual-expert affinity insight. revision: yes

  2. Referee: [Abstract and Evaluation] The abstract states performance numbers and accuracy retention but supplies no experimental details, baselines, error bars, dataset sizes, or ablation results; the full manuscript must provide these to allow verification of the weakest assumption that pruning reliably reshapes expert demand into a smaller, stable working set exploitable under tight memory budgets.

    Authors: We agree that the abstract is intentionally concise. The full manuscript already details all experimental aspects in Sections 4 and 5, including specific baselines (e.g., existing MoE offloading systems and token-pruning-only variants), dataset sizes and splits, multiple-run error bars, and ablation results supporting the reshaping of expert demand. To improve verifiability, we have added a brief mention of key experimental conditions to the abstract and included a consolidated results table with error bars and dataset information in the main body. These revisions ensure the core assumption about pruning-induced affinity is fully supported and checkable. revision: partial

Circularity Check

0 steps flagged

No significant circularity; affinity presented as empirical observation

full rationale

The paper's central insight—that pruning visual tokens produces visual-expert affinity by concentrating and stabilizing expert accesses—is introduced as a systems observation rather than a mathematical derivation. No equations, fitted parameters, or self-referential loops appear in the provided text. The performance claims rest on the combination of compression, prediction, and orchestration mechanisms applied to this observed effect, without any step reducing by construction to its own inputs or to a self-citation chain. The derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper rests on the domain assumption that MoE routing in VL models is driven by token content and that visual tokens dominate access patterns; it introduces visual-expert affinity as an observed effect without independent falsifiable evidence beyond the system itself.

axioms (1)
  • domain assumption MoE layers route tokens to a subset of experts based on learned gating
    Standard assumption for all MoE models; invoked implicitly when discussing expert accesses.
invented entities (1)
  • visual-expert affinity no independent evidence
    purpose: To name the phenomenon that token pruning concentrates and stabilizes expert accesses
    New term introduced to explain why pruning helps offloading beyond simple compute reduction; no external validation provided.

pith-pipeline@v0.9.0 · 5528 in / 1333 out tokens · 58621 ms · 2026-05-09T16:06:57.449448+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 45 canonical work pages · 9 internal anchors

  1. [1]

    Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. 2025. DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models. arXiv:2503.02175 [cs.CV]https://arxiv.org/ abs/2503.02175

  2. [2]

    Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Min- jia Zhang, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale. InProceedings of the International Conference on High Perfor- mance Computing, Networki...

  3. [3]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Ze- sen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wen- bin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

  4. [4]

    Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631 (2025)

  5. [5]

    Hongtao Chen, Weiyu Xie, Boxin Zhang, Jingqi Tang, Jiahao Wang, Jianwei Dong, Shaoyuan Chen, Ziwei Yuan, Chen Lin, Chengyu Qiu, Yuening Zhu, Qingliang Ou, Jiaqi Liao, Xianglin Chen, Zhiyuan Ai, Yongwei Wu, and Mingxing Zhang. 2025. KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models. In Proceedings of the ACM SIGOPS 31s...

  6. [6]

    Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, and Aymen Shabou

  7. [7]

    PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models. InCVPR. 14582–14592. doi:10.1109/ CVPR52734.2025.01359

  8. [8]

    Haojie Duanmu, Xiuhong Li, Zhihang Yuan, Size Zheng, Jiangfei Duan, Xingcheng Zhang, and Dahua Lin. 2025. MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design. arXiv:2505.05799 [cs.LG]https://arxiv.org/abs/2505.05799

  9. [9]

    Artyom Eliseev and Denis Mazur. 2023. Fast Inference of Mixture-of- Experts Language Models with Offloading. arXiv:2312.17238 [cs.LG] https://arxiv.org/abs/2312.17238

  10. [10]

    Zehao Fan, Zhenyu Liu, Yunzhen Liu, Yayue Hou, Hadjer Ben- meziane, Kaoutar El Maghraoui, and Liu Liu. 2025. Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems. arXiv:2512.04476 [cs.LG]https://arxiv.org/abs/2512.04476

  11. [11]

    Othmane Friha, Mohamed Amine Ferrag, Burak Kantarci, Burak Cak- mak, Arda Ozgun, and Nassira Ghoualmi-Zine. 2024. LLM-Based Edge Intelligence: A Comprehensive Survey on Architectures, Applications, Security and Trustworthiness.IEEE Open Journal of the Communica- tions Society5 (2024), 5799–5856. doi:10.1109/OJCOMS.2024.3456549

  12. [12]

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. 2025. MME: A Comprehen- sive Evaluation Benchmark for Multimodal Large Language Models. arXiv:2306.13394 [cs.CV]https://arxiv.org/abs/2306.13394

  13. [13]

    Georgi Gerganov. 2023. llama.cpp: Port of Facebook’s LLaMA model in C/C++.https://github.com/ggerganov/llama.cpp

  14. [14]

    Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan

  15. [15]

    Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate

  16. [16]

    Duc Hoang, Ajay Jaiswal, Mohammad Samragh, and Minsik Cho. 2026. SpecMD: A Comprehensive Study On Speculative Expert Prefetching. arXiv:2602.03921 [cs.LG]https://arxiv.org/abs/2602.03921

  17. [17]

    Yushi Huang, Zining Wang, Zhihang Yuan, Yifu Ding, Ruihao Gong, Jinyang Guo, Xianglong Liu, and Jun Zhang. 2026. MoDES: Accel- erating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping. arXiv:2511.15690 [cs.CV]https://arxiv.org/ abs/2511.15690

  18. [18]

    Ahmadreza Jeddi, Negin Baghbanzadeh, Elham Dolatabadi, and Babak Taati. 2025. Similarity-Aware Token Pruning: Your VLM but Faster. arXiv:2503.11549 [cs.CV]https://arxiv.org/abs/2503.11549

  19. [19]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie- Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teve...

  20. [20]

    Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, and Baris Kasikci

  21. [21]

    Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models.arXiv preprint arXiv:2402.07033, 2024

    Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture- of-Experts Models. arXiv:2402.07033 [cs.LG]https://arxiv.org/abs/ 2402.07033

  22. [22]

    Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Xiaozhou Ye, Ye Ouyang, Linghe Kong, and Yunxin Liu. 2024. SwapMoE: Serving Off- the-shelf MoE-based Large Language Models with Tunable Memory Budget. arXiv:2308.15030 [cs.AI]https://arxiv.org/abs/2308.15030

  23. [23]

    Gonzalez, Hao Zhang, and Ion Sto- ica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

  24. [24]

    Donghyun Lee, Je-Yong Lee, Genghan Zhang, Mo Tiwari, and Aza- lia Mirhoseini. 2024. CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models. arXiv:2404.08763 [cs.LG]https: //arxiv.org/abs/2404.08763

  25. [25]

    Shuhuai Li, Jianghao Lin, Dongdong Ge, and Yinyu Ye. 2026. MoE- SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios. arXiv:2603.09983 [cs.LG]https: //arxiv.org/abs/2603.09983

  26. [26]

    Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, Yu Qiao, Yali Wang, and Limin Wang. 2024. VideoChat-Flash: Hierarchi- cal Compression for Long-Context Video Modeling.arXiv preprint arXiv:2501.00574(2024)

  27. [27]

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models. InThe 2023 Conference on Empirical Methods in Natural Language Processing.https://openreview.net/forum?id= xozJw0kZXF

  28. [28]

    Jingcong Liang, Siyuan Wang, Miren Tian, Yitong Li, Duyu Tang, and Zhongyu Wei. 2026. Not All Models Suit Expert Offload- ing: On Local Routing Consistency of Mixture-of-Expert Models. arXiv:2505.16056 [cs.LG]https://arxiv.org/abs/2505.16056

  29. [29]

    Bin Lin, Zhenyu Tang, Yang Ye, Jinfa Huang, Junwu Zhang, Ya- tian Pang, Peng Jin, Munan Ning, Jiebo Luo, and Li Yuan. 2024. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models. arXiv:2401.15947 [cs.CV]https://arxiv.org/abs/2401.15947

  30. [30]

    Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, and Bin Chen

  31. [31]

    HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models

    HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models. arXiv:2508.00553 [cs.CV]https: //arxiv.org/abs/2508.00553

  32. [32]

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. 2024. MMBench: Is Your Multi-modal Model an All-around Player? arXiv:2307.06281 [cs.CV]https://arxiv.org/abs/ 2307.06281

  33. [33]

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chun- yuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai

  34. [34]

    OCRBench: on the hidden mystery of OCR in large multi- modal models.Science China Information Sciences67, 12 (Dec. 2024). doi:10.1007/s11432-024-4235-6

  35. [35]

    Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, and Beidi Chen. 2023. Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time. arXiv:2310.17157 [cs.LG]https://arxiv.org/ abs/2310.17157

  36. [36]

    NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU.https: //www.nvidia.com/en-us/data-center/a100/. Accessed: 2026-03-04

  37. [37]

    NVIDIA Corporation. 2024. NVIDIA Jetson Orin Architecture. https://www.nvidia.com/en-us/autonomous-machines/embedded- systems/jetson-orin/. Accessed: 2026-03-04

  38. [38]

    Tianfan Peng, Yuntao Du, Pengzhou Ji, Shijie Dong, Kailin Jiang, Mingchuan Ma, Yijun Tian, Jinhe Bi, Qian Li, Wei Du, Feng Xiao, and Lizhen Cui. 2025. Can Visual Input Be Compressed? A Vi- sual Token Compression Benchmark for Large Multimodal Models. arXiv:2511.02650 [cs.CV]https://arxiv.org/abs/2511.02650

  39. [39]

    2.1 ed.)

    Samsung Electronics 2023.Samsung NVMe SSD 980 PRO Data Sheet (rev. 2.1 ed.). Samsung Electronics.https://download.semiconductor. samsung.com/resources/data-sheet/Samsung-NVMe-SSD-980- PRO-Data-Sheet_Rev.2.1_230509_10129505081019.pdfAccessed: March 2, 2026

  40. [40]

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv:1701.06538 [cs.LG]https://arxiv.org/abs/1701.06538

  41. [41]

    Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. 2025. FastVID: Dy- namic Density Pruning for Fast Video Large Language Models. arXiv:2503.11187 [cs.CV]https://arxiv.org/abs/2503.11187

  42. [42]

    Zixu Shen, Kexin Chu, Yifan Zhang, Dawei Xiang, Runxin Wu, and Wei Zhang. 2025. ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference. arXiv:2510.26730 [cs.DC] https://arxiv.org/abs/2510.26730

  43. [43]

    Xiaoniu Song, Zihang Zhong, Rong Chen, and Haibo Chen. 2025. ProMoE: Fast MoE-based LLM Serving using Proactive Caching. arXiv:2410.22134 [cs.DC]https://arxiv.org/abs/2410.22134

  44. [44]

    Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. 2024. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles(Austin, TX, USA)(SOSP ’24). Association for Computing Machinery, New York, NY, USA, 590–606. doi:10.1145/3694715.3695964

  45. [45]

    Boyuan Sun, Jiaxing Zhao, Xihan Wei, and Qibin Hou. 2025. LLaVA- Scissor: Token Compression with Semantic Connected Components for Video LLMs.arXiv preprint arXiv:2506.21862(2025)

  46. [46]

    Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann Heng, Chao Li, and Minyi Guo. 2024. HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference. arXiv:2411.01433 [cs.LG]https://arxiv.org/abs/2411.01433

  47. [47]

    Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Philip S. Yu. 2023. Multimodal Large Language Models: A Survey. arXiv:2311.13165 [cs.AI]https://arxiv.org/abs/2311.13165

  48. [48]

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. 2024. DeepSeek-VL2: Mixture-of-Experts...

  49. [49]

    Guoyang Xia, Yifeng Ding, Fengfa Li, Lei Ren, Wei Chen, Fangxi- ang Feng, and Xiaojie Wang. 2025. FastMMoE: Accelerating Mul- timodal Large Language Models through Dynamic Expert Activa- tion and Routing-Aware Token Pruning. arXiv:2511.17885 [cs.CV] https://arxiv.org/abs/2511.17885

  50. [50]

    Xinfeng Xia, Jiacheng Liu, Xiaofeng Hou, Peng Tang, Mingxuan Zhang, Wenfeng Wang, and Chao Li. 2025. MoE-Prism: Disentangling Mono- lithic Experts for Elastic MoE Services via Model-System Co-Designs. arXiv:2510.19366 [cs.CL]https://arxiv.org/abs/2510.19366

  51. [51]

    Cheng Xu, Xiaofeng Hou, Jiacheng Liu, Chao Li, Tianhao Huang, Xi- aozhi Zhu, Mo Niu, Lingyu Sun, Peng Tang, Tongqiao Xu, Kwang-Ting Cheng, and Minyi Guo. 2023. MMBench: Benchmarking End-to-End Multi-modal DNNs and Understanding Their Hardware-Software Im- plications. In2023 IEEE International Symposium on Workload Charac- terization (IISWC). 154–166. doi:...

  52. [52]

    Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. 2025. MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache. arXiv:2401.14361 [cs.LG]https://arxiv. org/abs/2401.14361

  53. [53]

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. 2024. VisionZip: Longer is Better but Not Necessary in Vision Language Models.arXiv preprint arXiv:2412.04467 (2024)

  54. [54]

    Hanfei Yu, Xingqi Cui, Hong Zhang, Hao Wang, and Hao Wang. 2025. Taming Latency-Memory Trade-Off in MoE-Based LLM Serving via Fine-Grained Expert Offloading. arXiv:2502.05370 [cs.LG]https:// arxiv.org/abs/2502.05370

  55. [55]

    Shaolei Zhang, Qingkai Fang, Zhe Yang, and Yang Feng. 2025. LLaVA- Mini: Efficient Image and Video Large Multimodal Models with One Vision Token. arXiv:2501.03895 [cs.CV]https://arxiv.org/abs/2501. 03895

  56. [56]

    Yuning Zhang, Grant Pinkert, Nan Yang, Yanli Li, and Dong Yuan

  57. [57]

    DuoServe-MoE: Dual-Phase Expert Prefetch and Caching for LLM Inference QoS Assurance

    DuoServe-MoE: Dual-Phase Expert Prefetch and Caching for LLM Inference QoS Assurance. arXiv:2509.07379 [cs.DC]https://arxiv. org/abs/2509.07379

  58. [58]

    Yushu Zhao, Yubin Qin, Yang Wang, Xiaolong Yang, Huiming Han, Shaojun Wei, Yang Hu, and Shouyi Yin. 2026. MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts. In2026 31st Asia and South Pacific Design Automation Conference (ASP-DAC). 999–1005. doi:10.1109/ASP-DAC66049.2026. 11420472

  59. [59]

    SGLang: Efficient Execution of Structured Language Model Programs

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Sto- ica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs. arXiv:2312.07104 [cs.AI]https://arxiv.org/abs/2312.07104

  60. [60]

    Shuzhang Zhong, Yanfan Sun, Ling Liang, Runsheng Wang, Ru Huang, and Meng Li. 2025. HybriMoE: Hybrid CPU-GPU Schedul- ing and Cache Management for Efficient MoE Inference. In2025 62nd ACM/IEEE Design Automation Conference (DAC). 1–7. doi:10.1109/ DAC63849.2025.11133274

  61. [61]

    Jiaying Zhu, Yurui Zhu, Xin Lu, Wenrui Yan, Dong Li, Kunlin Liu, Xueyang Fu, and Zheng-Jun Zha. 2025. VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs. arXiv:2510.16598 [cs.CV]https://arxiv.org/abs/2510.16598