SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
Llava-scissor: Token compression with semantic con- nected components for video llms
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4roles
background 1polarities
background 1representative citing papers
VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
LinMU achieves linear-complexity multimodal understanding by swapping self-attention for an M-MATE dual-branch block and distilling from a frozen teacher VLM, matching accuracy with up to 2.7x faster TTFT and 9x higher throughput.
StateKV is an inference-time technique that replaces quadratic self-attention prefill in video VLMs with a fixed-capacity importance-based recurrent state, keeping accuracy near full attention on long-video benchmarks without retraining.
citing papers explorer
-
LinMU: Multimodal Understanding Made Linear
LinMU achieves linear-complexity multimodal understanding by swapping self-attention for an M-MATE dual-branch block and distilling from a frozen teacher VLM, matching accuracy with up to 2.7x faster TTFT and 9x higher throughput.