arxiv: 2605.14310 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

Ailar Mahdizadeh , Puria Azadi , Muchen Li , Xiangteng He , Leonid Sigal

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords streaming video understandingKV cache compressioncoreset selectionvision-language modelsdiversity criterionlong video benchmarksmemory efficiency

0 comments

The pith

Treating KV-cache compression as coreset selection improves streaming video understanding under fixed memory budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that token-wise heuristics such as recency or saliency discard too much of the accumulated visual history when compressing the key-value cache in vision-language models. It instead frames the problem as selecting a small coreset that jointly covers the geometry of the key and value spaces while adding an orthogonality term to promote diversity. Experiments across four open-source VLMs and five long-video benchmarks show gains over existing streaming baselines at the same cache size. A reader would care because the approach offers a more systematic way to keep memory usage low without losing the ability to answer future questions about the full video stream.

Core claim

By viewing KV-cache compression as a bicriteria coreset selection task in a joint key-value representation, the method balances coverage of retrieval structure and output-relevant information while using an orthogonality criterion, connected to log-determinant subset selection, to favor diverse directions and thereby retain a more representative subset than local pruning heuristics.

What carries the argument

The bicriteria objective that selects a coreset by balancing coverage in joint key-value space with an orthogonality-driven diversity term.

If this is right

Accuracy rises on long-video and streaming benchmarks compared with recency, redundancy, or saliency baselines at fixed cache size.
The retained subset preserves both retrieval geometry and output-relevant signals better than independent token scoring.
Orthogonality favors new directions, reducing redundancy within the compressed cache.
The same selection principle applies across multiple open-source vision-language models without task-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same coverage-plus-diversity logic could be tested on long-context language models to compress their KV caches.
Adaptive cache budgets that grow or shrink based on detected scene complexity might further improve the method.
Links to log-determinant selection suggest possible connections to other matrix-based summarization tasks in machine learning.

Load-bearing premise

A small geometrically covering subset of tokens will retain the information needed for arbitrary future reasoning queries over the entire video history.

What would settle it

A controlled test on a long video where a query depends on visual content from a cluster of tokens that the coreset discards, producing a measurable drop in answer accuracy relative to the full cache.

Figures

Figures reproduced from arXiv: 2605.14310 by Ailar Mahdizadeh, Leonid Sigal, Muchen Li, Puria Azadi, Xiangteng He.

**Figure 1.** Figure 1: CoRDS framework. As streaming frames arrive, the accumulated KV cache is compressed by D2 -style coreset selection in joint K-V space (Eq. 4) with an orthogonal anti-redundancy term (Eq. 9); the persistent cache is then used to answer asynchronous question queries. Efficient video understanding via visual token pruning. To enable efficient long-video understanding, recent works have been focusing on infer… view at source ↗

**Figure 2.** Figure 2: Orthogonality intuition. (a) A near-span candidate has a small residual r K and adds little new coverage, whereas (b) a large-residual candidate expands the span into a new region. Orthogonal anti-redundancy. Coverage-based selection has a subtle failure mode: two candidates with high dα scores may lie in nearly the same direction in feature space, so selecting both wastes budget while leaving other direc… view at source ↗

**Figure 3.** Figure 3: Per-task accuracy across compression ratios on Qwen2-VL-7B. Each radar compares methods across compression ratios; the dashed ring denotes Full KV [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Coverage CDF on MLVU anomaly recognition (Qwen2-VL-7B, |M|=6K). Curves further left indicate better full-cache coverage under joint K∥V , K-only, and V -only metrics. anomaly recognition task with Qwen2-VL-7B at |M|=6K; CDFs for the remaining MLVU tasks are in Appendix B.5. Across all three metrics, CoRDS covers the full cache better than InfiniPot-V, with peak gaps of ∆ = 0.136 (Joint K∥V ), 0.156 (K-only… view at source ↗

**Figure 5.** Figure 5: Backward qualitative examples on EPM/ASI: CoRDS (ours) recovers the clue laid down hundreds of [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Forward qualitative examples on REC/CRR: CoRDS (ours) accumulates the three javelin-throw occur [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Real-time qualitative examples on STU/ATR/ACR/OCR: with the clue at the ask frame, CoRDS (ours) [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Coverage CDF of nearest-retained cosine distance for eight MLVU tasks under three KV-compression [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

Streaming video understanding with large vision-language models (VLMs) requires a compact memory that can support future reasoning over an ever-growing visual history. A common solution is to compress the key-value (KV) cache, but existing streaming methods typically rely on local token-wise heuristics, such as recency, temporal redundancy, or saliency, which do not explicitly optimize whether the retained cache is representative of the accumulated history. We propose to view KV-cache compression as a coreset selection problem: rather than scoring tokens independently for retention, we select a small subset that covers the geometry of the accumulated visual cache. Our method operates in a joint KV representation and introduces a bicriteria objective that balances coverage in key and value spaces, preserving both retrieval structure and output-relevant information. To encourage a more diverse retained subset, we further introduce an orthogonality-driven diversity criterion that favors candidates contributing new directions beyond the current selection, and connect this criterion to log-determinant subset selection. Across four open-source VLMs and five long-video and streaming-video benchmarks, our method improves over heuristic streaming compression baselines under a fixed cache budget. These results highlight that representative coreset selection offers a more effective principle, than token-wise pruning, for memory-constrained streaming video understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes KV-cache compression for streaming VLMs as bicriteria coreset selection with an orthogonality diversity term, reports empirical gains over heuristics, but supplies no argument that the subset preserves information for arbitrary future queries.

read the letter

The main point is that this work casts KV-cache compression as a bicriteria coreset problem over the accumulated history, balancing coverage in key and value spaces while adding an orthogonality-driven diversity term tied to log-determinant selection. It shows gains over standard local heuristics on four VLMs and five benchmarks under fixed cache size. That is the concrete advance worth noting first.

Referee Report

3 major / 1 minor

Summary. The paper proposes CoRDS, a method that frames KV-cache compression for streaming video understanding as a coreset selection problem. It introduces a bicriteria objective balancing coverage in joint key-value spaces with an orthogonality-driven diversity term connected to log-determinant subset selection, and reports empirical improvements over heuristic baselines across four open-source VLMs and five long-video/streaming benchmarks under fixed cache budgets.

Significance. If the central empirical claim holds after addressing the noted gaps, the work would demonstrate that geometry-aware coreset selection can outperform local token-wise heuristics for memory-constrained VLM inference on long videos, offering a principled alternative to recency or saliency pruning with potential impact on efficient streaming video reasoning systems.

major comments (3)

[§3.2] §3.2 (bicriteria objective): the balance weight between key and value coverage and the diversity regularization strength are free parameters, yet no ablation or sensitivity analysis is reported on their effect on performance; this is load-bearing because the abstract and experiments attribute gains to the joint objective without showing robustness to these choices.
[§4] §4 (experiments): results claim consistent improvements over baselines but provide no quantitative variance, standard deviations across runs, or statistical significance tests, undermining the ability to assess whether gains are reliable or could be explained by benchmark-specific tuning.
[§3.3] §3.3 (diversity criterion and log-det connection): no approximation guarantee, regret bound, or worst-case analysis is supplied showing that the selected coreset retains tokens necessary for arbitrary future queries; the evaluation is confined to the five benchmark distributions, leaving open the risk that low-coverage but query-critical tokens are systematically dropped when the stream diverges from observed statistics.

minor comments (1)

[§3] Notation for the joint KV representation and the orthogonality term could be clarified with an explicit equation reference in the method section to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3.2] §3.2 (bicriteria objective): the balance weight between key and value coverage and the diversity regularization strength are free parameters, yet no ablation or sensitivity analysis is reported on their effect on performance; this is load-bearing because the abstract and experiments attribute gains to the joint objective without showing robustness to these choices.

Authors: We agree that sensitivity analysis is necessary to support the claims. In the revised manuscript we will add an ablation study varying the balance weight between key and value coverage and the diversity regularization strength across a range of values, reporting performance on the main benchmarks to demonstrate robustness of the reported gains. revision: yes
Referee: [§4] §4 (experiments): results claim consistent improvements over baselines but provide no quantitative variance, standard deviations across runs, or statistical significance tests, undermining the ability to assess whether gains are reliable or could be explained by benchmark-specific tuning.

Authors: We acknowledge the absence of variance reporting. Although the core selection procedure is deterministic, we will repeat experiments under varied stream orderings and report standard deviations together with statistical significance tests (e.g., paired Wilcoxon tests) in the updated tables to quantify reliability. revision: yes
Referee: [§3.3] §3.3 (diversity criterion and log-det connection): no approximation guarantee, regret bound, or worst-case analysis is supplied showing that the selected coreset retains tokens necessary for arbitrary future queries; the evaluation is confined to the five benchmark distributions, leaving open the risk that low-coverage but query-critical tokens are systematically dropped when the stream diverges from observed statistics.

Authors: We note that worst-case guarantees for arbitrary future queries are difficult to obtain without strong distributional assumptions on the query stream. Our diversity term is connected to log-determinant subset selection, which inherits known submodular approximation properties in the static setting. In the revision we will expand §3.3 with an explicit discussion of these limitations, the empirical scope of the evaluation, and potential risks for out-of-distribution streams. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines a new bicriteria coreset objective for KV-cache compression (coverage in joint key-value space plus orthogonality linked to log-det diversity) and reports empirical gains on five benchmarks across four VLMs. No step reduces a claimed prediction or result to a fitted parameter defined on the same data, nor does any load-bearing premise collapse to a self-citation, ansatz smuggled via prior work, or renaming of a known result. The central claim remains an empirical observation under fixed cache budgets rather than a self-referential derivation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that geometric coverage in embedding space approximates utility for future reasoning, plus a small number of tunable weights in the bicriteria objective.

free parameters (2)

bicriteria balance weight
Controls relative importance of key-space versus value-space coverage; value not stated in abstract.
diversity regularization strength
Scales the orthogonality term; value not stated in abstract.

axioms (1)

domain assumption A small subset selected by coverage and diversity criteria will retain sufficient information for downstream reasoning over the full history.
Invoked when claiming that coreset selection is superior to token-wise heuristics for preserving retrieval structure and output-relevant information.

pith-pipeline@v0.9.0 · 5536 in / 1240 out tokens · 107419 ms · 2026-05-15T02:24:37.272302+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

bicriteria distance d_α(i,S)=min_j [α||k_i-k_j||²+(1-α)||v_i-v_j||²] ... orthogonal novelty score Orth(i|St)=η||r_i||²+(1-η)||rV_i||² ... log det((U_K_S)^⊤U_K_S+εI)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

greedy orthogonalization achieves (1-e^{-1}) log-det approximation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 6 internal anchors

[2]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Long context transfer from language to vision.Transactions on Machine Learning Research, 2025

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.Transactions on Machine Learning Research, 2025

work page 2025
[4]

Infinipot-v: Memory-constrained kv cache compression for streaming video understanding

Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot-v: Memory-constrained kv cache compression for streaming video understanding. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[5]

Streaming long video understanding with large language models

Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jiaqi Wang, and Dahua Lin. Streaming long video understanding with large language models. InAdvances in Neural Information Processing Systems, 2024.https://arxiv.org/abs/2405.16009

work page arXiv 2024
[6]

Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, and Vikas Chandra

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, and Vikas Chandra. LongVU: Spatiotemporal adaptive compression for long video-language understanding. InForty-secon...

work page 2025
[7]

Streamingbench: Assessing the gap for mllms to achieve streaming video understanding

Junming Lin, Zheng Fang, Chi Chen, Haoxuan Cheng, Zihao Wan, Fuwen Luo, Ziyue Wang, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12147–12151. IEEE, 2026

work page 2026
[8]

Flash-vstream: Efficient real-time understanding for long video streams

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, and Xiaojie Jin. Flash-vstream: Efficient real-time understanding for long video streams. InProceedings of the IEEE/CVF international conference on computer vision, pages 21059–21069, 2025

work page 2025
[9]

Training-free adaptive frame selection for video-language understanding, 2026

Bhavika Suresh Devnani, Jitesh Jain, Humphrey Shi, and Judy Hoffman. Training-free adaptive frame selection for video-language understanding, 2026

work page 2026
[10]

Adaptive keyframe sampling for long video understanding

Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29118–29128, 2025

work page 2025
[11]

Storm: Token-efficient long video understanding for multimodal llms

Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, et al. Storm: Token-efficient long video understanding for multimodal llms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5830–5841, 2025

work page 2025
[12]

Streaming video question-answering with in-context video kv-cache retrieval

Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, Hao Jiang, et al. Streaming video question-answering with in-context video kv-cache retrieval. InICLR, 2025

work page 2025
[13]

Streammem: Query-agnostic kv cache memory for streaming video understanding.arXiv preprint arXiv:2508.15717, 2025

Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, and Mengye Ren. Streammem: Query-agnostic kv cache memory for streaming video understanding.arXiv preprint arXiv:2508.15717, 2025. 10

work page arXiv 2025
[14]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

work page 2025
[15]

LLaV A-NeXT: A strong zero-shot video understanding model

Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. LLaV A-NeXT: A strong zero-shot video understanding model. https://llava-vl. github.io/blog/2024-04-30-llava-next-video/, April 2024

work page 2024
[16]

EgoSchema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A diagnostic benchmark for very long-form video language understanding. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023

work page 2023
[17]

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, and Xing Sun. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis.arXiv preprint...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025

Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025

work page 2025
[20]

Needle in a video haystack: A scalable synthetic evaluator for video mllms

Zijia Zhao, Haoyu Lu, Yuqi Huo, Yifan Du, Tongtian Yue, Longteng Guo, Bingning Wang, Weipeng Chen, and Jing Liu. Needle in a video haystack: A scalable synthetic evaluator for video mllms. InInternational Conference on Learning Representations, 2025

work page 2025
[21]

InternVideo2.5: Empowering video MLLMs with long and rich context modeling.arXiv:2501.12386, 2025

Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, Min Dou, Kai Chen, Wenhai Wang, Yu Qiao, Yali Wang, and Limin Wang. InternVideo2.5: Empowering video MLLMs with long and rich context modeling.arXiv:2501.12386, 2025

work page arXiv 2025
[22]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. VideoLLaMA 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv:2406.07476, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-LLaV A: Learning united visual representation by alignment before projection.arXiv:2311.10122, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Efficient video sampling: Pruning temporally redundant tokens for faster vlm inference.arXiv preprint arXiv:2510.14624, 2025

Natan Bagrov, Eugene Khvedchenia, Borys Tymchenko, Shay Aharon, Lior Kadoch, Tomer Keren, Ofri Masad, Yonatan Geifman, Ran Zilberstein, Tuomas Rintamaki, et al. Efficient video sampling: Pruning temporally redundant tokens for faster vlm inference.arXiv preprint arXiv:2510.14624, 2025

work page arXiv 2025
[25]

Divprune: Diversity-based visual token pruning for large multimodal models

Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. Divprune: Diversity-based visual token pruning for large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9392–9401, 2025

work page 2025
[26]

Token merging: Your ViT but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InICLR, 2023

work page 2023
[27]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22857–22867, 2025

work page 2025
[28]

StreamingVLM: Real-time understanding for infinite video streams

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. StreamingVLM: Real-time understanding for infinite video streams. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[29]

k-means++: The advantages of careful seeding

David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. InProceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027–1035, 2007

work page 2007
[30]

Dycoke: Dynamic compression of tokens for fast video large language models

Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18992–19001, 2025. 11

work page 2025
[31]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaV A-OneVision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Nemhauser, Laurence A

George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. An analysis of approximations for maximizing submodular set functions.Mathematical Programming, 14(1):265–294, 1978

work page 1978
[33]

What shape was the meat I put on the pan?

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. InInternational Conference on Learning Representations (ICLR), 2024. A Theory and Proofs This appendix provides the formal statements and proofs underlying two ingredients of our method: (i) the bicriteria coverage objective that we minimise as a tractable surrogate...

work page arXiv 2024