pith. sign in

arxiv: 2605.22678 · v1 · pith:JI5HTFUHnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI

Swift Sampling: Selecting Temporal Surprises via Taylor Series

Pith reviewed 2026-05-22 05:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords frame selectiontemporal surpriseTaylor expansionlong video understandingquery-agnostic samplingvideo QApredictive coding
0
0 comments X

The pith

Swift Sampling selects high-information frames by detecting deviations from a Taylor-predicted trajectory in visual latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training-free algorithm that treats a video as a differentiable trajectory of features in latent space. It computes velocity and acceleration, then applies low-order Taylor expansion to forecast the expected path of later frames. Frames whose actual features diverge sharply from this forecast are labeled temporal surprises and retained for sampling. The method adds negligible compute cost yet raises accuracy on long-video question answering and related tasks, with the largest gains when only a small number of frames can be kept.

Core claim

By modeling video features as a smooth trajectory and using Taylor expansion to approximate the future evolution, frames that deviate strongly from the predicted manifold can be identified as temporally surprising and selected, yielding better downstream performance than uniform or prior query-agnostic sampling especially under tight frame budgets.

What carries the argument

Low-order Taylor expansion of feature velocity and acceleration to predict subsequent frames and flag large deviations from the expected manifold.

If this is right

  • Raises accuracy by up to 12.5 points over uniform sampling on long-video QA benchmarks.
  • Adds only 0.02 times the baseline compute cost.
  • Outperforms prior training-free methods across ten downstream tasks without video-specific tuning.
  • Delivers the largest relative benefit when frame budgets are severely limited.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same deviation-from-prediction logic could be tested on audio or text sequences for efficient long-sequence processing.
  • Higher-order terms in the expansion might capture more complex motions without added training.
  • Adding task-specific query signals to the surprise score could further improve selection for particular questions.

Load-bearing premise

That large deviations from the low-order Taylor-predicted path in visual latent space correspond to the most informative frames.

What would settle it

Measure whether accuracy gains disappear on videos whose feature trajectories change abruptly and cannot be approximated well by a low-order polynomial.

Figures

Figures reproduced from arXiv: 2605.22678 by Bhuvan Sachdeva, Dahye Kim, Deepti Ghadiyaram, Karan Uppal, Naman Gupta, Vineeth N. Balasubramanian.

Figure 1
Figure 1. Figure 1: Swift Sampling efficiently identifies temporal surprises in videos by measuring how much a frame deviates from the trajectory predicted by its preceding context. Using a Taylor expansion of visual features, we select frames with the largest residuals within their temporal neighborhood as keyframes. Top: Temporal surprise captured using Taylor residual over time. Bottom: input frames and frames selected by … view at source ↗
Figure 2
Figure 2. Figure 2: Each frame is represented on the latent feature trajec￾tory, where we apply Taylor expansion over preceding frames to predict the next frame feature. The residual between the prediction and the actual feature measures how much the trajec￾tory deviates from a smooth continuation. Frames with large residuals correspond to temporal surprises, e.g., seal suddenly emerging from the ice, which Swift Sampling eff… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of frame selection on a sample video from the Video-MME dataset, given a budget to select 8 frames out of 128. Answering the question requires identifying the temporal order of several visually similar but semantically distinct painting events: establishing the background, drawing the water-lily pads, adding flowers, and increasing texture. Uniform sampling captures the background an… view at source ↗
read the original abstract

While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Swift Sampling, a training-free frame selection algorithm for long videos. It models a video as a differentiable trajectory in visual latent space, computes velocity and acceleration of features, applies low-order Taylor expansion to predict the expected path of subsequent frames, and selects frames that diverge sharply from this predicted manifold as temporally surprising high-information frames. The paper claims this outperforms uniform sampling and prior query-agnostic baselines across three long-video question answering benchmarks and 10 downstream tasks, with accuracy gains up to +12.5 points especially under limited frame budgets for long videos, while incurring only 0.02x additional computational cost.

Significance. If the empirical results prove robust and the modeling assumptions hold under scrutiny, the approach could provide a lightweight, query-agnostic alternative for efficient video processing in resource-constrained settings, drawing an interesting parallel to predictive coding. The low overhead and lack of training or auxiliary networks are potential strengths. However, the current lack of visible derivation details, error bars, dataset descriptions, and ablation evidence limits assessment of whether the gains are reliable or generalizable.

major comments (2)
  1. Abstract: The central claim of up to +12.5 point accuracy gains on long-video QA benchmarks with tight frame budgets is presented without derivation details, error bars, dataset descriptions, ablation studies, or statistical tests. This leaves the superiority over uniform sampling and baselines as an unverified assertion rather than a substantiated result.
  2. Method (modeling assumption): The assumption that video features form a locally smooth differentiable trajectory well-approximated by low-order Taylor expansion, such that large residuals reliably mark task-relevant content, is not justified. Typical extractors produce discrete, non-stationary sequences with abrupt jumps (scene cuts, camera motion); if the order or time parameterization is misspecified, selected frames may be outliers or noise rather than informative, directly undermining the performance claims.
minor comments (2)
  1. Abstract: Specify the exact '10 different downstream tasks' and name the 'prior query-agnostic baselines' with citations for reproducibility.
  2. Abstract: Quantify the '30x cheaper overhead' claim with explicit comparisons to named leading baselines and precise overhead measurements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of results and the justification of modeling choices.

read point-by-point responses
  1. Referee: Abstract: The central claim of up to +12.5 point accuracy gains on long-video QA benchmarks with tight frame budgets is presented without derivation details, error bars, dataset descriptions, ablation studies, or statistical tests. This leaves the superiority over uniform sampling and baselines as an unverified assertion rather than a substantiated result.

    Authors: We agree that the abstract would benefit from additional context to support the claims. In the revised manuscript, we have updated the abstract to reference the three benchmarks and note the inclusion of error bars. Full derivation details for the Taylor expansion appear in Section 2, dataset descriptions and ablation studies are expanded in Section 4, and statistical significance tests have been added to the results tables and supplementary material. These changes make the empirical support explicit. revision: yes

  2. Referee: Method (modeling assumption): The assumption that video features form a locally smooth differentiable trajectory well-approximated by low-order Taylor expansion, such that large residuals reliably mark task-relevant content, is not justified. Typical extractors produce discrete, non-stationary sequences with abrupt jumps (scene cuts, camera motion); if the order or time parameterization is misspecified, selected frames may be outliers or noise rather than informative, directly undermining the performance claims.

    Authors: We acknowledge that global smoothness does not always hold due to scene cuts and camera motion. Our approach mitigates this by applying low-order Taylor expansions over short local temporal windows, where the residual still highlights deviations from the predicted path. We have added a dedicated paragraph in Section 2.2 justifying the local approximation, analyzing residual behavior at discontinuities, and reporting ablations on expansion order and window size that demonstrate robustness. The consistent gains across 10 downstream tasks indicate that selected frames carry task-relevant information rather than noise. revision: yes

Circularity Check

0 steps flagged

No circularity: Taylor-based sampling is a self-contained heuristic

full rationale

The paper presents Swift Sampling as a training-free heuristic that models video features as a trajectory, computes first- and second-order differences, and uses low-order Taylor expansion to flag large residuals as 'surprises.' No equations or text in the provided manuscript reduce the selection rule to a fitted parameter on the target benchmarks, a self-citation chain, or a definition that presupposes the output. Performance numbers (+12.5 points) are reported as downstream empirical results rather than as a mathematical identity. The derivation therefore remains independent of its own claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated premise that latent-space trajectories are sufficiently smooth for low-order Taylor approximation to be predictive and that deviation from that prediction marks high-information content. No free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption A video can be modeled as a differentiable trajectory in visual latent space whose short-term evolution is well approximated by Taylor expansion.
    This modeling choice is required for the surprise detection step described in the abstract.

pith-pipeline@v0.9.0 · 5739 in / 1266 out tokens · 34989 ms · 2026-05-22T05:51:50.994761+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 17 internal anchors

  1. [1]

    Uni- comp: Rethinking video compression through informational uniqueness.arXiv preprint arXiv:2512.03575, 2025

    Chao Yuan, Shimin Chen, Minliang Lin, Limeng Qiao, Guanglu Wan, and Lin Ma. Uni- comp: Rethinking video compression through informational uniqueness.arXiv preprint arXiv:2512.03575, 2025

  2. [2]

    Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.Nature neuroscience, 2(1):79–87, 1999

    Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.Nature neuroscience, 2(1):79–87, 1999

  3. [3]

    The free-energy principle: a unified brain theory?Nature reviews neuroscience, 11(2):127–138, 2010

    Karl Friston. The free-energy principle: a unified brain theory?Nature reviews neuroscience, 11(2):127–138, 2010

  4. [4]

    Differential quantization of communication signals, July 29 1952

    Cassius C Cutler. Differential quantization of communication signals, July 29 1952. US Patent 2,605,361

  5. [5]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava- video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

  6. [6]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  7. [7]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  8. [8]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, 2020

  9. [9]

    Pyscenedetect.https://www.scenedetect.com/

  10. [10]

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

  11. [11]

    Gmflow: Learning optical flow via global matching

    Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, and Dacheng Tao. Gmflow: Learning optical flow via global matching. InCVPR, 2022

  12. [12]

    Flowformer: A transformer architecture for optical flow

    Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. In ECCV, 2022

  13. [13]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023

  14. [14]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, 2022

  15. [15]

    Long-clip: Unlocking the long-text capability of clip

    Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. InECCV, 2024

  16. [16]

    Video-llava: Learning united visual representation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. 2024

  17. [17]

    Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

  18. [18]

    Chat-univi: Unified visual representation empowers large language models with image and video understanding

    Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In CVPR, 2024

  19. [19]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024. 10

  20. [20]

    Llavanext: Improved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024

  21. [21]

    Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

    Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

  22. [22]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  23. [23]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

  24. [24]

    Slowfast-llava-1.5: A family of token-efficient video large language models for long-form video understanding.arXiv preprint arXiv:2503.18943, 2025

    Mingze Xu, Mingfei Gao, Shiyu Li, Jiasen Lu, Zhe Gan, Zhengfeng Lai, Meng Cao, Kai Kang, Yinfei Yang, and Afshin Dehghan. Slowfast-llava-1.5: A family of token-efficient video large language models for long-form video understanding.arXiv preprint arXiv:2503.18943, 2025

  25. [25]

    Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025

    Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al. Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025

  26. [26]

    LongVILA: Scaling Long-Context Visual Language Models for Long Videos

    Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

  27. [27]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024

  28. [28]

    LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

  29. [29]

    Scaling video-language models to 10k frames via hierarchical differential distillation.arXiv preprint arXiv:2504.02438, 2025

    Chuanqi Cheng, Jian Guan, Wei Wu, and Rui Yan. Scaling video-language models to 10k frames via hierarchical differential distillation.arXiv preprint arXiv:2504.02438, 2025

  30. [30]

    Video-xl: Extra-long vision language model for hour-scale video understanding

    Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InCVPR, 2025

  31. [31]

    Revisiting the" video" in video-language understanding

    Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the" video" in video-language understanding. InCVPR, 2022

  32. [32]

    Flexible frame selection for efficient video reasoning

    Shyamal Buch, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. Flexible frame selection for efficient video reasoning. InCVPR, 2025

  33. [33]

    Frame-voyager: Learning to query frames for video large language models.arXiv preprint arXiv:2410.03226, 2024

    Sicheng Yu, Chengkai Jin, Huanyu Wang, Zhenghao Chen, Sheng Jin, Zhongrong Zuo, Xiaolei Xu, Zhenbang Sun, Bingni Zhang, Jiawei Wu, et al. Frame-voyager: Learning to query frames for video large language models.arXiv preprint arXiv:2410.03226, 2024

  34. [34]

    M-llm based video frame selection for efficient video understanding

    Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, et al. M-llm based video frame selection for efficient video understanding. InCVPR, 2025

  35. [35]

    Viarl: Adaptive temporal grounding via visual iterated amplification reinforcement learning.arXiv preprint arXiv:2505.15447, 2025

    Ziqiang Xu, Qi Dai, Tian Xie, Yifan Yang, Kai Qiu, DongDong Chen, Zuxuan Wu, and Chong Luo. Viarl: Adaptive temporal grounding via visual iterated amplification reinforcement learning.arXiv preprint arXiv:2505.15447, 2025

  36. [36]

    Refocus: Reinforcement-guided frame optimization for contextual understanding.arXiv preprint arXiv:2506.01274, 2025

    Hosu Lee, Junho Kim, Hyunjun Kim, and Yong Man Ro. Refocus: Reinforcement-guided frame optimization for contextual understanding.arXiv preprint arXiv:2506.01274, 2025. 11

  37. [37]

    Self-chained image-language model for video localization and question answering.NeurIPS, 2023

    Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Self-chained image-language model for video localization and question answering.NeurIPS, 2023

  38. [38]

    Cambrian-s: Towards spatial supersensing in video

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis L Brown II, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video. InThe Fourteenth International Conference on Learning Representations, 2025

  39. [39]

    Generative frame sampler for long video understanding

    Linli Yao, Haoning Wu, Kun Ouyang, Yuanxing Zhang, Caiming Xiong, Bei Chen, Xu Sun, and Junnan Li. Generative frame sampler for long video understanding. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

  40. [40]

    Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning

    Sara Ghazanfari, Francesco Croce, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, and Siddharth Garg. Chain-of-frames: Advancing video understanding in multimodal llms via frame-aware reasoning.arXiv preprint arXiv:2506.00318, 2025

  41. [41]

    Adaptive keyframe sampling for long video understanding

    Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. InCVPR, 2025

  42. [42]

    From frames to clips: Training-free adaptive key clip selection for long-form video understand- ing.arXiv preprint arXiv:2510.02262, 2025

    Guangyu Sun, Archit Singhal, Burak Uzkent, Mubarak Shah, Chen Chen, and Garin Kessler. From frames to clips: Training-free adaptive key clip selection for long-form video understand- ing.arXiv preprint arXiv:2510.02262, 2025

  43. [43]

    Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms

    Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, and Jian Luan. Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms. InICCV, 2025

  44. [44]

    Mdp3: A training-free approach for list-wise frame selection in video-llms

    Hui Sun, Shiyin Lu, Huanyu Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Ming Li. Mdp3: A training-free approach for list-wise frame selection in video-llms. InICCV, 2025

  45. [45]

    Tem- poral chain of thought: Long-video understanding by thinking in frames.arXiv preprint arXiv:2507.02001, 2025

    Anurag Arnab, Ahmet Iscen, Mathilde Caron, Alireza Fathi, and Cordelia Schmid. Tem- poral chain of thought: Long-video understanding by thinking in frames.arXiv preprint arXiv:2507.02001, 2025

  46. [46]

    In NeurIPS

    Jian Hu, Zixu Cheng, Chenyang Si, Wei Li, and Shaogang Gong. Cos: Chain-of-shot prompting for long video understanding.arXiv preprint arXiv:2502.06428, 2025

  47. [47]

    Focus: Efficient keyframe selection for long video understanding.arXiv preprint arXiv:2510.27280, 2025

    Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, and Yang You. Focus: Efficient keyframe selection for long video understanding.arXiv preprint arXiv:2510.27280, 2025

  48. [48]

    Bolt: Boost large vision-language model without training for long-form video understanding

    Shuming Liu, Chen Zhao, Tianqi Xu, and Bernard Ghanem. Bolt: Boost large vision-language model without training for long-form video understanding. InCVPR, 2025

  49. [49]

    Adard-key: Adaptive relevance-diversity keyframe sampling for long-form video understanding.arXiv preprint arXiv:2510.02778, 2025

    Xian Zhang, Zexi Wu, Zinuo Li, Hongming Xu, Luqi Gong, Farid Boussaid, Naoufel Werghi, and Mohammed Bennamoun. Adard-key: Adaptive relevance-diversity keyframe sampling for long-form video understanding.arXiv preprint arXiv:2510.02778, 2025

  50. [50]

    Maxinfo: A training-free key-frame selection method using maximum volume for enhanced video understanding

    Pengyi Li, Irina Abdullaeva, Alexander Gambashidze, Andrey Kuznetsov, and Ivan Oseledets. Maxinfo: A training-free key-frame selection method using maximum volume for enhanced video understanding. InWACV, 2026

  51. [51]

    Elastictok: Adaptive tokenization for image and video.arXiv preprint arXiv:2410.08368, 2024

    Wilson Yan, V olodymyr Mnih, Aleksandra Faust, Matei Zaharia, Pieter Abbeel, and Hao Liu. Elastictok: Adaptive tokenization for image and video.arXiv preprint arXiv:2410.08368, 2024

  52. [52]

    Evatok: Adaptive length video tokenization for efficient visual autoregressive generation.arXiv preprint arXiv:2603.12267, 2026

    Tianwei Xiong, Jun Hao Liew, Zilong Huang, Zhijie Lin, Jiashi Feng, and Xihui Liu. Evatok: Adaptive length video tokenization for efficient visual autoregressive generation.arXiv preprint arXiv:2603.12267, 2026

  53. [53]

    Learning adaptive and temporally causal video tokenization in a 1d latent space, 2025

    Yan Li, Changyao Tian, Renqiu Xia, Ning Liao, Weiwei Guo, Junchi Yan, Hongsheng Li, Jifeng Dai, Hao Li, and Xue Yang. Learning adaptive and temporally causal video tokenization in a 1d latent space, 2025

  54. [54]

    Infotok: Adaptive discrete video tokenizer via information- theoretic compression.arXiv preprint arXiv:2512.16975, 2025

    Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, et al. Infotok: Adaptive discrete video tokenizer via information- theoretic compression.arXiv preprint arXiv:2512.16975, 2025. 12

  55. [55]

    Token Merging: Your ViT But Faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022

  56. [56]

    Prunevid: Visual token pruning for efficient video large language models

    Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Visual token pruning for efficient video large language models. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

  57. [57]

    Taylor videos for action recognition

    Lei Wang, Xiuyuan Yuan, Tom Gedeon, and Liang Zheng. Taylor videos for action recognition. arXiv preprint arXiv:2402.03019, 2024

  58. [58]

    Unfolding videos dynamics via taylor expansion.arXiv preprint arXiv:2409.02371, 2024

    Siyi Chen, Minkyu Choi, Zesen Zhao, Kuan Han, Qing Qu, and Zhongming Liu. Unfolding videos dynamics via taylor expansion.arXiv preprint arXiv:2409.02371, 2024

  59. [59]

    From reusing to forecasting: Accelerating diffusion models with taylorseers

    Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers. InICCV, 2025

  60. [60]

    Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

    Hanshuai Cui, Zhiqing Tang, Zhi Yao, Fanshuai Meng, Weijia Jia, and Wei Zhao. Not all frames deserve full computation: Accelerating autoregressive video generation via selective computation and predictive extrapolation.arXiv preprint arXiv:2604.02979, 2026

  61. [61]

    SIAM, 2007

    Randall J LeVeque.Finite difference methods for ordinary and partial differential equations: steady-state and time-dependent problems. SIAM, 2007

  62. [62]

    Approximate taylor methods for odes.Computers & Fluids, 159:156–166, 2017

  63. [63]

    Taylorswiftnet: Taylor driven temporal modeling for swift future frame prediction

    Saber Pourheydari, Emad Bahrami, Mohsen Fayyaz, Gianpiero Francesca, Mehdi Noroozi, and Juergen Gall. Taylorswiftnet: Taylor driven temporal modeling for swift future frame prediction. arXiv preprint arXiv:2110.14392, 2021

  64. [64]

    Cover and Joy A

    Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley-Interscience, Hoboken, NJ, 2nd edition, 2006

  65. [65]

    Bishop.Pattern Recognition and Machine Learning

    Christopher M. Bishop.Pattern Recognition and Machine Learning. Springer, New York, 2006

  66. [66]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

  67. [67]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

  68. [68]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

  69. [69]

    Fastvid: Dynamic density pruning for fast video large language models

    Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large language models. arXiv preprint arXiv:2503.11187, 2025

  70. [70]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025

  71. [71]

    Mlvu: Benchmarking multi-task long video understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InCVPR, 2025

  72. [72]

    Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024. 13

  73. [73]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021

  74. [74]

    Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

  75. [75]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 14