LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs
Pith reviewed 2026-05-20 14:49 UTC · model grok-4.3
The pith
Compact vision encoders trained to mimic compressed teacher outputs let Video LLMs process eight times more frames at reduced latency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LiteFrame is a compact yet effective video encoder backbone for Video LLMs. It is trained via Compressed Token Distillation, a framework in which the student directly predicts the information-dense, spatio-temporally compressed representations generated by a large teacher vision model and thereby avoids redundant computation. When combined with Language Model Adaptation, the approach establishes a new latency-accuracy Pareto frontier: relative to InternVL3-8B it delivers a 35 percent reduction in end-to-end latency, supports eight times more frames, and raises average accuracy across video understanding benchmarks.
What carries the argument
Compressed Token Distillation (CTD), a training procedure that has a compact student vision encoder learn to output the same spatio-temporally compressed representations produced by a larger teacher model, thereby cutting per-frame encoder cost.
If this is right
- Video LLMs can ingest eight times more frames while staying inside the same overall compute envelope.
- End-to-end latency falls by 35 percent compared with InternVL3-8B at the new operating point.
- Average accuracy rises across standard video understanding benchmarks.
- The dominant latency source moves from vision encoding to later stages of the pipeline.
- A new latency-accuracy frontier appears for long-form video tasks under fixed budgets.
Where Pith is reading between the lines
- The same student-teacher compression pattern could be tested on audio or multimodal streams that share similar per-frame costs.
- Longer feasible video lengths may improve performance on tasks that require extended temporal context, such as narrative understanding.
- Varying the teacher model or the degree of spatio-temporal compression could reveal task-specific optimal trade-offs.
- Combining LiteFrame with existing post-hoc token reduction methods might produce additive gains in both speed and length.
Load-bearing premise
A compact student encoder can faithfully recover the spatio-temporally compressed representations of the large teacher model without discarding information the downstream language model needs for accurate video understanding.
What would settle it
Measuring end-to-end video understanding accuracy on long clips and finding that LiteFrame with eight times more frames yields lower scores than the original teacher encoder with the original frame count, after equalizing total compute.
read the original abstract
The fundamental challenge in scaling Video Large Language Models (Video LLMs) to long-form video lies in managing the explosion of visual-token context length. Existing strategies predominantly focus on "post-hoc" token reduction -- reducing visual tokens after feature extraction to alleviate the LLM's computational overhead. While these methods effectively reduce the number of visual tokens, we observe that the primary latency bottleneck then shifts from the LLM to the expensive per-frame processing of the vision encoder. To address this, we introduce LiteFrame, a strong, yet highly efficient video encoder backbone for Video LLMs. To train LiteFrame, we propose Compressed Token Distillation (CTD), a novel training framework that teaches a compact student vision encoder to directly predict information-dense, spatio-temporally compressed representations produced by a large teacher vision model, effectively bypassing redundant computation. When coupled with further Language Model Adaptation (LMA), this approach results in a new latency-accuracy Pareto frontier -- compared with InternVL3-8B, LiteFrame provides a 35% reduction in end-to-end latency while processing 8$\times$ more frames and improves average video understanding accuracy across multiple benchmarks. Our results demonstrate a new potential path to unlocking longer-form video understanding under fixed compute budgets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LiteFrame, a compact vision encoder for Video LLMs trained via Compressed Token Distillation (CTD) to predict spatio-temporally compressed representations from a larger teacher model. Combined with Language Model Adaptation (LMA), it claims to establish a new latency-accuracy Pareto frontier, delivering 35% lower end-to-end latency while processing 8× more frames and higher average accuracy on video understanding benchmarks relative to InternVL3-8B.
Significance. If the empirical results hold under detailed validation, the work could meaningfully advance long-video scaling in LLMs by targeting the vision-encoder bottleneck through distillation of compressed tokens rather than post-hoc LLM-side reduction, potentially enabling longer contexts under fixed compute.
major comments (2)
- Abstract: The central claim of a new Pareto frontier (35% latency reduction, 8× frames, improved accuracy) is presented without ablations, error bars, dataset specifics, or quantitative fidelity metrics for the CTD student outputs, leaving the robustness of the accuracy gains unassessable from the reported text.
- CTD training framework: The assumption that the compact student faithfully reconstructs task-critical spatio-temporal details (fine-grained motion, object relations, temporal ordering) required by the LLM after LMA lacks supporting evidence such as representation similarity scores, per-task ablations, or error analysis, which directly bears on whether the reported accuracy improvements are reliable.
minor comments (1)
- Abstract: Clarify the exact compression ratio and distillation loss weights used in CTD, as these are listed among the free parameters but not numerically specified in the high-level description.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The comments highlight opportunities to improve the clarity and robustness of our presentation. We address each major comment point by point below, providing additional context from the full manuscript while proposing targeted revisions to strengthen the work without altering its core contributions.
read point-by-point responses
-
Referee: Abstract: The central claim of a new Pareto frontier (35% latency reduction, 8× frames, improved accuracy) is presented without ablations, error bars, dataset specifics, or quantitative fidelity metrics for the CTD student outputs, leaving the robustness of the accuracy gains unassessable from the reported text.
Authors: We agree that the abstract, due to its brevity, does not enumerate all supporting details. The full manuscript provides these in Section 4 (Experiments), including dataset specifications (Video-MME, EgoSchema, MLVU, and others), ablation studies on frame scaling and latency components, and error bars (mean ± std over 3 seeds) in Tables 2–5. Quantitative fidelity of CTD outputs is indirectly validated via end-to-end accuracy but we will add explicit metrics. In revision we will expand the abstract with a single sentence referencing the supporting experimental sections and include a new table or paragraph reporting CTD fidelity (e.g., token reconstruction MSE and feature similarity) to make the robustness immediately verifiable from the high-level claims. revision: partial
-
Referee: CTD training framework: The assumption that the compact student faithfully reconstructs task-critical spatio-temporal details (fine-grained motion, object relations, temporal ordering) required by the LLM after LMA lacks supporting evidence such as representation similarity scores, per-task ablations, or error analysis, which directly bears on whether the reported accuracy improvements are reliable.
Authors: The manuscript validates the student’s utility primarily through downstream accuracy gains on benchmarks that explicitly test the cited capabilities (temporal ordering in MLVU, fine-grained motion in EgoSchema, object relations in Video-MME). These end-to-end results serve as the strongest indicator that critical spatio-temporal information is preserved after LMA. Nevertheless, we recognize the value of direct evidence. In the revised manuscript we will add (i) representation similarity scores (CKA and cosine similarity) between LiteFrame and teacher features on held-out video clips, (ii) per-task ablations isolating motion- and relation-heavy subsets, and (iii) a brief error analysis of cases where accuracy drops, thereby directly addressing the concern about reconstruction fidelity. revision: yes
Circularity Check
No circularity: empirical claims rest on external benchmarks
full rationale
The paper proposes Compressed Token Distillation (CTD) and Language Model Adaptation (LMA) as training procedures for a compact student encoder, then reports empirical latency and accuracy results against the external baseline InternVL3-8B. No equations, fitted parameters, or self-citations are shown that reduce the reported 35% latency reduction or 8× frame scaling to quantities defined by the authors' own inputs. The central Pareto-frontier claim is therefore self-contained against external benchmarks rather than tautological.
Axiom & Free-Parameter Ledger
free parameters (1)
- distillation loss weights and compression ratio
Reference graph
Works this paper leans on
-
[1]
S. Bai et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin. ShareGPT4Video: Improving video understanding and generation with better captions. InEuropean Conference on Computer Vision (ECCV), 2024a. L. Chen, H. Zhao, T. Liu, S. Bai, J. Lin, C. Zhou, and B. Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
C. Fu, H. Yuan, Y. Dong, Y.-F. Zhang, Y. Shen, X. Hu, X. Li, J. Su, C. Long, X. Xie, Y. Xie, X. Zheng, X. Yang, H. Cao, Y. Wu, Z. Liu, X. Sun, C. Shan, and R. He. Video-MME-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y. Li, Z. Liu, and C. Li. LLaVA-OneVision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. 12 LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs K. Li, Y. Wang, Y. He, Y. Li, Y. Wang, Y. Liu, Z. Wang, J. Xu, G. Chen, P. Luo, L. Wang, and Y. Qiao. MVBench: ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
C. Liao, H. Tan, Z. Wang, Q. Liu, X. Zhang, W. Meng, Z. Wang, Y. Liu, K. Wang, Y. Liu, et al. Are we using the right benchmark: An evaluation framework for visual token compression methods.arXiv preprint arXiv:2510.07143,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
K. Nan, R. Xie, P. Zhou, T. Fan, Z. Yang, Z. Chen, and Y. Tai. OpenVid-1M: A large-scale high-quality dataset for text-to-video generation.arXiv preprint arXiv:2407.02371,
work page internal anchor Pith review Pith/arXiv arXiv
- [7]
- [8]
-
[9]
X. Shen, Y. Xiong, C. Zhao, L. Wu, J. Chen, C. Zhu, Z. Liu, F. Xiao, B. Varadarajan, F. Bordes, et al. LongVU: Spatiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434,
work page internal anchor Pith review Pith/arXiv arXiv
- [10]
-
[11]
W. Wang, Z. He, W. Hong, Y. Cheng, X. Zhang, J. Qi, X. Gu, S. Huang, B. Xu, Y. Dong, M. Ding, and J. Tang. LVBench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
13 LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs Y. Wang, X. Li, Z. Yan, Y. He, J. Yu, X. Zeng, C. Wang, C. Ma, H. Huang, J. Gao, et al. InternVideo2.5: Empowering video MLLMs with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025a. Z. Wang, S. Purushwalkam, C. Xiong, S. Savarese, H. Ji, and R. Xu. DyMU: Dynami...
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
FineVideo: Afine-graineddatasetforvideounderstanding.arXiv preprint arXiv:2405.00000,
Y.Xuetal. FineVideo: Afine-graineddatasetforvideounderstanding.arXiv preprint arXiv:2405.00000,
-
[14]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Y. Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li. LLaVA-Video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, Y. Duan, H. Tian, W. Su, J. Shao, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
We perform LMA on 8×NVIDIA H100 GPUs for 25K steps, which completes in a few hours
This ensures that the total visual token volume matches that of the teacher’s typical input (equivalent to 8–32 frames for the uncompressed teacher). We perform LMA on 8×NVIDIA H100 GPUs for 25K steps, which completes in a few hours. Datasets.Our training pipeline utilizes a subset of the video data described in InternVL2.5 pa- per(Chenetal.,2024c). Tobes...
work page 2024
-
[17]
to ensure robust visual-textual alignment. The datasets used in our work adhere to their respective license: ShareGPT4Video (CC-BY-NC-4.0), FineVideo (CC-BY), OpenVid-1M (CC-BY 4.0), and LLaVA-Video-178K (Apache License 2.0). Note that CLEVRER, and NTURGB+D are exclusively restricted to non-commercial, academic research purposes. Benchmarks.We employ thre...
work page 2025
-
[18]
Furthermore, we report the latency- accuracytrade-offsonshortvideobenchmarks,suchasMVBench(Lietal.,2024b)andTVbench(Cores et al., 2024), as well as additional long video benchmarks, including LVBench (Wang et al.,
work page 2024
-
[19]
and MMBench-Video (Fang et al., 2024), in Section B. The datasets evaluated in this work strictly adhere to their respective licenses: MVBench (MIT), HLVid (Apache 2.0), TVBench and MMBench-Video (CC-BY-4.0), and LongVideoBench, MLVU, and LVBench (CC-BY-NC-SA-4.0). Note that Video-MME is exclusively restricted to non-commercial, academic research purposes...
work page 2024
-
[20]
Latency.Latency is measured end-to-end including ViT processing and LLM prefilling. We focus exclusively on the visual token encoding and its prefilling stage, as these constitute the primary bottleneckaddressedbyourcontributions. Wereportthemedianlatencyover100iterations, following a 40 iterations of warmup phase (140 iterations total), measured on a sin...
work page 2024
-
[21]
and MMBench-Video (Fang et al., 2024). Notably, on LVBench, LiteFrame utilizing 512-frame input achieves a superior score of 43.9 compared to the 64-frame baseline (43.5) while operating 38% faster, successfully leveraging the extended temporal context. On MMBench-Video, a free-form QA benchmark, LiteFrame demonstrates improved efficiency, particularly wi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.