hub

arXiv preprint arXiv:2507.20198 , year=

Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang · 2025 · arXiv 2507.20198

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Memory Retrieval in Visuomotor Policies for Long-Horizon Robot Control

cs.RO · 2026-06-23 · unverdicted · novelty 7.0

HALO distills VLM priors via question-answering objectives and applies sparse attention to enable reliable memory retrieval from up to eight minutes of history in imitation-learned visuomotor policies.

Very Efficient Listwise Multimodal Reranking for Long Documents

cs.IR · 2026-05-12 · unverdicted · novelty 7.0

ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.

AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression

cs.CL · 2026-06-23 · unverdicted · novelty 6.0

AVOC is a retrieval-inspired token compression framework that improves long-form audio-video understanding in multimodal LLMs by selecting informative tokens based on classical IR principles.

Learnable Token Sparsification for Efficient Gigapixel Whole Slide Image Reasoning

cs.CV · 2026-06-07 · unverdicted · novelty 6.0

Learnable sparsification framework compresses WSI visual tokens to 32 (0.78% of original) via SparseLearn, achieving 73.32% accuracy on SlideBench (TCGA) and outperforming baselines.

dMoE: dLLMs with Learnable Block Experts

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

dMoE aggregates token expert distributions to block level in dLLMs, cutting unique experts from 69.5 to 14.6, memory by 76-80%, and latency by 1.14-1.66x while retaining 99.11% performance.

EarlyTom: Early Token Compression Completes Fast Video Understanding

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

EarlyTom is a training-free early token compression method inside the vision encoder with decoupled spatial selection that reduces TTFT up to 2.65x and FLOPs 61% on LLaVA-OneVision-7B while keeping accuracy comparable to full tokens.

O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding

cs.CV · 2026-05-26 · unverdicted · novelty 6.0

O-MARC is a compression distillation framework that lets compact omnimodal models maintain or exceed full-token performance on video QA while cutting latency and memory by about 35%.

OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models

cs.CV · 2025-11-18 · conditional · novelty 6.0

OmniZip introduces an audio-guided dynamic token compression framework that achieves 3.42X inference speedup and 1.4X memory reduction for omnimodal LLMs without any training.

MVPruner: Dynamic Token Pruning for Accelerating Multi-view Vision-Language Models in Autonomous Driving

cs.CV · 2026-06-26 · unverdicted · novelty 5.0 · 2 refs

MVPruner is a two-stage adaptive token pruning technique for multi-view VLMs that achieves 87.3% FLOPs reduction and 4.97x prefilling speedup while retaining 98.5% accuracy on DriveLM.

Linear Scaling Video VLMs for Long Video Understanding

cs.CV · 2026-05-29 · unverdicted · novelty 5.0

StateKV is an inference-time technique that replaces quadratic self-attention prefill in video VLMs with a fixed-capacity importance-based recurrent state, keeping accuracy near full attention on long-video benchmarks without retraining.

Temporal Aware Pruning for Efficient Diffusion-based Video Generation

cs.CV · 2026-05-18 · unverdicted · novelty 5.0 · 2 refs

TAPE applies temporal-aware token pruning with smoothing, reselection, and timestep scheduling to speed up video diffusion models while preserving visual fidelity and coherence.

OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

cs.AI · 2026-05-12 · unverdicted · novelty 5.0

OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.

Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs

cs.CV · 2026-05-10 · unverdicted · novelty 5.0

Fre-Res compresses video tokens by preserving spatial anchors and representing temporal dynamics with low-frequency residual tokens derived from 1D-DCT on inter-frame residuals, plus a Spatial-Guided Absorber to reinject the information.

Toward Native Multimodal Modeling: A Roadmap

cs.CV · 2026-05-25 · unverdicted · novelty 3.0

A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.

citing papers explorer

Showing 14 of 14 citing papers.

Memory Retrieval in Visuomotor Policies for Long-Horizon Robot Control cs.RO · 2026-06-23 · unverdicted · none · ref 35
HALO distills VLM priors via question-answering objectives and applies sparse attention to enable reliable memory retrieval from up to eight minutes of history in imitation-learned visuomotor policies.
Very Efficient Listwise Multimodal Reranking for Long Documents cs.IR · 2026-05-12 · unverdicted · none · ref 29
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression cs.CL · 2026-06-23 · unverdicted · none · ref 30
AVOC is a retrieval-inspired token compression framework that improves long-form audio-video understanding in multimodal LLMs by selecting informative tokens based on classical IR principles.
Learnable Token Sparsification for Efficient Gigapixel Whole Slide Image Reasoning cs.CV · 2026-06-07 · unverdicted · none · ref 46
Learnable sparsification framework compresses WSI visual tokens to 32 (0.78% of original) via SparseLearn, achieving 73.32% accuracy on SlideBench (TCGA) and outperforming baselines.
dMoE: dLLMs with Learnable Block Experts cs.CL · 2026-05-29 · unverdicted · none · ref 71
dMoE aggregates token expert distributions to block level in dLLMs, cutting unique experts from 69.5 to 14.6, memory by 76-80%, and latency by 1.14-1.66x while retaining 99.11% performance.
EarlyTom: Early Token Compression Completes Fast Video Understanding cs.CV · 2026-05-28 · unverdicted · none · ref 33
EarlyTom is a training-free early token compression method inside the vision encoder with decoupled spatial selection that reduces TTFT up to 2.65x and FLOPs 61% on LLaVA-OneVision-7B while keeping accuracy comparable to full tokens.
O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding cs.CV · 2026-05-26 · unverdicted · none · ref 23
O-MARC is a compression distillation framework that lets compact omnimodal models maintain or exceed full-token performance on video QA while cutting latency and memory by about 35%.
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models cs.CV · 2025-11-18 · conditional · none · ref 34
OmniZip introduces an audio-guided dynamic token compression framework that achieves 3.42X inference speedup and 1.4X memory reduction for omnimodal LLMs without any training.
MVPruner: Dynamic Token Pruning for Accelerating Multi-view Vision-Language Models in Autonomous Driving cs.CV · 2026-06-26 · unverdicted · none · ref 25 · 2 links
MVPruner is a two-stage adaptive token pruning technique for multi-view VLMs that achieves 87.3% FLOPs reduction and 4.97x prefilling speedup while retaining 98.5% accuracy on DriveLM.
Linear Scaling Video VLMs for Long Video Understanding cs.CV · 2026-05-29 · unverdicted · none · ref 53
StateKV is an inference-time technique that replaces quadratic self-attention prefill in video VLMs with a fixed-capacity importance-based recurrent state, keeping accuracy near full attention on long-video benchmarks without retraining.
Temporal Aware Pruning for Efficient Diffusion-based Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 167 · 2 links
TAPE applies temporal-aware token pruning with smoothing, reselection, and timestep scheduling to speed up video diffusion models while preserving visual fidelity and coherence.
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 38
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs cs.CV · 2026-05-10 · unverdicted · none · ref 21
Fre-Res compresses video tokens by preserving spatial anchors and representing temporal dynamics with low-frequency residual tokens derived from 1D-DCT on inter-frame residuals, plus a Spatial-Guided Absorber to reinject the information.
Toward Native Multimodal Modeling: A Roadmap cs.CV · 2026-05-25 · unverdicted · none · ref 200
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.

arXiv preprint arXiv:2507.20198 , year=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer