Swift Sampling: Selecting Temporal Surprises via Taylor Series

Bhuvan Sachdeva; Dahye Kim; Deepti Ghadiyaram; Karan Uppal; Naman Gupta; Vineeth N. Balasubramanian

arxiv: 2605.22678 · v1 · pith:JI5HTFUHnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI

Swift Sampling: Selecting Temporal Surprises via Taylor Series

Dahye Kim , Bhuvan Sachdeva , Karan Uppal , Naman Gupta , Vineeth N. Balasubramanian , Deepti Ghadiyaram This is my paper

Pith reviewed 2026-05-22 05:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords frame selectiontemporal surpriseTaylor expansionlong video understandingquery-agnostic samplingvideo QApredictive coding

0 comments

The pith

Swift Sampling selects high-information frames by detecting deviations from a Taylor-predicted trajectory in visual latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training-free algorithm that treats a video as a differentiable trajectory of features in latent space. It computes velocity and acceleration, then applies low-order Taylor expansion to forecast the expected path of later frames. Frames whose actual features diverge sharply from this forecast are labeled temporal surprises and retained for sampling. The method adds negligible compute cost yet raises accuracy on long-video question answering and related tasks, with the largest gains when only a small number of frames can be kept.

Core claim

By modeling video features as a smooth trajectory and using Taylor expansion to approximate the future evolution, frames that deviate strongly from the predicted manifold can be identified as temporally surprising and selected, yielding better downstream performance than uniform or prior query-agnostic sampling especially under tight frame budgets.

What carries the argument

Low-order Taylor expansion of feature velocity and acceleration to predict subsequent frames and flag large deviations from the expected manifold.

If this is right

Raises accuracy by up to 12.5 points over uniform sampling on long-video QA benchmarks.
Adds only 0.02 times the baseline compute cost.
Outperforms prior training-free methods across ten downstream tasks without video-specific tuning.
Delivers the largest relative benefit when frame budgets are severely limited.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same deviation-from-prediction logic could be tested on audio or text sequences for efficient long-sequence processing.
Higher-order terms in the expansion might capture more complex motions without added training.
Adding task-specific query signals to the surprise score could further improve selection for particular questions.

Load-bearing premise

That large deviations from the low-order Taylor-predicted path in visual latent space correspond to the most informative frames.

What would settle it

Measure whether accuracy gains disappear on videos whose feature trajectories change abruptly and cannot be approximated well by a low-order polynomial.

Figures

Figures reproduced from arXiv: 2605.22678 by Bhuvan Sachdeva, Dahye Kim, Deepti Ghadiyaram, Karan Uppal, Naman Gupta, Vineeth N. Balasubramanian.

**Figure 1.** Figure 1: Swift Sampling efficiently identifies temporal surprises in videos by measuring how much a frame deviates from the trajectory predicted by its preceding context. Using a Taylor expansion of visual features, we select frames with the largest residuals within their temporal neighborhood as keyframes. Top: Temporal surprise captured using Taylor residual over time. Bottom: input frames and frames selected by … view at source ↗

**Figure 2.** Figure 2: Each frame is represented on the latent feature trajectory, where we apply Taylor expansion over preceding frames to predict the next frame feature. The residual between the prediction and the actual feature measures how much the trajectory deviates from a smooth continuation. Frames with large residuals correspond to temporal surprises, e.g., seal suddenly emerging from the ice, which Swift Sampling eff… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of frame selection on a sample video from the Video-MME dataset, given a budget to select 8 frames out of 128. Answering the question requires identifying the temporal order of several visually similar but semantically distinct painting events: establishing the background, drawing the water-lily pads, adding flowers, and increasing texture. Uniform sampling captures the background an… view at source ↗

read the original abstract

While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Swift Sampling, a training-free frame selection algorithm for long videos. It models a video as a differentiable trajectory in visual latent space, computes velocity and acceleration of features, applies low-order Taylor expansion to predict the expected path of subsequent frames, and selects frames that diverge sharply from this predicted manifold as temporally surprising high-information frames. The paper claims this outperforms uniform sampling and prior query-agnostic baselines across three long-video question answering benchmarks and 10 downstream tasks, with accuracy gains up to +12.5 points especially under limited frame budgets for long videos, while incurring only 0.02x additional computational cost.

Significance. If the empirical results prove robust and the modeling assumptions hold under scrutiny, the approach could provide a lightweight, query-agnostic alternative for efficient video processing in resource-constrained settings, drawing an interesting parallel to predictive coding. The low overhead and lack of training or auxiliary networks are potential strengths. However, the current lack of visible derivation details, error bars, dataset descriptions, and ablation evidence limits assessment of whether the gains are reliable or generalizable.

major comments (2)

Abstract: The central claim of up to +12.5 point accuracy gains on long-video QA benchmarks with tight frame budgets is presented without derivation details, error bars, dataset descriptions, ablation studies, or statistical tests. This leaves the superiority over uniform sampling and baselines as an unverified assertion rather than a substantiated result.
Method (modeling assumption): The assumption that video features form a locally smooth differentiable trajectory well-approximated by low-order Taylor expansion, such that large residuals reliably mark task-relevant content, is not justified. Typical extractors produce discrete, non-stationary sequences with abrupt jumps (scene cuts, camera motion); if the order or time parameterization is misspecified, selected frames may be outliers or noise rather than informative, directly undermining the performance claims.

minor comments (2)

Abstract: Specify the exact '10 different downstream tasks' and name the 'prior query-agnostic baselines' with citations for reproducibility.
Abstract: Quantify the '30x cheaper overhead' claim with explicit comparisons to named leading baselines and precise overhead measurements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of results and the justification of modeling choices.

read point-by-point responses

Referee: Abstract: The central claim of up to +12.5 point accuracy gains on long-video QA benchmarks with tight frame budgets is presented without derivation details, error bars, dataset descriptions, ablation studies, or statistical tests. This leaves the superiority over uniform sampling and baselines as an unverified assertion rather than a substantiated result.

Authors: We agree that the abstract would benefit from additional context to support the claims. In the revised manuscript, we have updated the abstract to reference the three benchmarks and note the inclusion of error bars. Full derivation details for the Taylor expansion appear in Section 2, dataset descriptions and ablation studies are expanded in Section 4, and statistical significance tests have been added to the results tables and supplementary material. These changes make the empirical support explicit. revision: yes
Referee: Method (modeling assumption): The assumption that video features form a locally smooth differentiable trajectory well-approximated by low-order Taylor expansion, such that large residuals reliably mark task-relevant content, is not justified. Typical extractors produce discrete, non-stationary sequences with abrupt jumps (scene cuts, camera motion); if the order or time parameterization is misspecified, selected frames may be outliers or noise rather than informative, directly undermining the performance claims.

Authors: We acknowledge that global smoothness does not always hold due to scene cuts and camera motion. Our approach mitigates this by applying low-order Taylor expansions over short local temporal windows, where the residual still highlights deviations from the predicted path. We have added a dedicated paragraph in Section 2.2 justifying the local approximation, analyzing residual behavior at discontinuities, and reporting ablations on expansion order and window size that demonstrate robustness. The consistent gains across 10 downstream tasks indicate that selected frames carry task-relevant information rather than noise. revision: yes

Circularity Check

0 steps flagged

No circularity: Taylor-based sampling is a self-contained heuristic

full rationale

The paper presents Swift Sampling as a training-free heuristic that models video features as a trajectory, computes first- and second-order differences, and uses low-order Taylor expansion to flag large residuals as 'surprises.' No equations or text in the provided manuscript reduce the selection rule to a fitted parameter on the target benchmarks, a self-citation chain, or a definition that presupposes the output. Performance numbers (+12.5 points) are reported as downstream empirical results rather than as a mathematical identity. The derivation therefore remains independent of its own claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated premise that latent-space trajectories are sufficiently smooth for low-order Taylor approximation to be predictive and that deviation from that prediction marks high-information content. No free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption A video can be modeled as a differentiable trajectory in visual latent space whose short-term evolution is well approximated by Taylor expansion.
This modeling choice is required for the surprise detection step described in the abstract.

pith-pipeline@v0.9.0 · 5739 in / 1266 out tokens · 34989 ms · 2026-05-22T05:51:50.994761+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames
IndisputableMonolith/Foundation/AbsoluteFloorClosure absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the Taylor residual– the ℓ2 distance between the predicted and the observed feature – serves as a principled, per-frame informativeness score

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 17 internal anchors

[1]

Uni- comp: Rethinking video compression through informational uniqueness.arXiv preprint arXiv:2512.03575, 2025

Chao Yuan, Shimin Chen, Minliang Lin, Limeng Qiao, Guanglu Wan, and Lin Ma. Uni- comp: Rethinking video compression through informational uniqueness.arXiv preprint arXiv:2512.03575, 2025

work page arXiv 2025
[2]

Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.Nature neuroscience, 2(1):79–87, 1999

Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.Nature neuroscience, 2(1):79–87, 1999

work page 1999
[3]

The free-energy principle: a unified brain theory?Nature reviews neuroscience, 11(2):127–138, 2010

Karl Friston. The free-energy principle: a unified brain theory?Nature reviews neuroscience, 11(2):127–138, 2010

work page 2010
[4]

Differential quantization of communication signals, July 29 1952

Cassius C Cutler. Differential quantization of communication signals, July 29 1952. US Patent 2,605,361

work page 1952
[5]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava- video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, 2020

work page 2020
[9]

Pyscenedetect.https://www.scenedetect.com/

work page
[10]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Gmflow: Learning optical flow via global matching

Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, and Dacheng Tao. Gmflow: Learning optical flow via global matching. InCVPR, 2022

work page 2022
[12]

Flowformer: A transformer architecture for optical flow

Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. In ECCV, 2022

work page 2022
[13]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023

work page 2023
[14]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, 2022

work page 2022
[15]

Long-clip: Unlocking the long-text capability of clip

Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. InECCV, 2024

work page 2024
[16]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. 2024

work page 2024
[17]

Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

work page 2025
[18]

Chat-univi: Unified visual representation empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In CVPR, 2024

work page 2024
[19]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Llavanext: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024

work page 2024
[21]

Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

work page arXiv 2024
[22]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Slowfast-llava-1.5: A family of token-efficient video large language models for long-form video understanding.arXiv preprint arXiv:2503.18943, 2025

Mingze Xu, Mingfei Gao, Shiyu Li, Jiasen Lu, Zhe Gan, Zhengfeng Lai, Meng Cao, Kai Kang, Yinfei Yang, and Afshin Dehghan. Slowfast-llava-1.5: A family of token-efficient video large language models for long-form video understanding.arXiv preprint arXiv:2503.18943, 2025

work page arXiv 2025
[25]

Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025

Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al. Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025

work page arXiv 2025
[26]

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Scaling video-language models to 10k frames via hierarchical differential distillation.arXiv preprint arXiv:2504.02438, 2025

Chuanqi Cheng, Jian Guan, Wei Wu, and Rui Yan. Scaling video-language models to 10k frames via hierarchical differential distillation.arXiv preprint arXiv:2504.02438, 2025

work page arXiv 2025
[30]

Video-xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InCVPR, 2025

work page 2025
[31]

Revisiting the" video" in video-language understanding

Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the" video" in video-language understanding. InCVPR, 2022

work page 2022
[32]

Flexible frame selection for efficient video reasoning

Shyamal Buch, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. Flexible frame selection for efficient video reasoning. InCVPR, 2025

work page 2025
[33]

Frame-voyager: Learning to query frames for video large language models.arXiv preprint arXiv:2410.03226, 2024

Sicheng Yu, Chengkai Jin, Huanyu Wang, Zhenghao Chen, Sheng Jin, Zhongrong Zuo, Xiaolei Xu, Zhenbang Sun, Bingni Zhang, Jiawei Wu, et al. Frame-voyager: Learning to query frames for video large language models.arXiv preprint arXiv:2410.03226, 2024

work page arXiv 2024
[34]

M-llm based video frame selection for efficient video understanding

Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, et al. M-llm based video frame selection for efficient video understanding. InCVPR, 2025

work page 2025
[35]

Viarl: Adaptive temporal grounding via visual iterated amplification reinforcement learning.arXiv preprint arXiv:2505.15447, 2025

Ziqiang Xu, Qi Dai, Tian Xie, Yifan Yang, Kai Qiu, DongDong Chen, Zuxuan Wu, and Chong Luo. Viarl: Adaptive temporal grounding via visual iterated amplification reinforcement learning.arXiv preprint arXiv:2505.15447, 2025

work page arXiv 2025
[36]

Refocus: Reinforcement-guided frame optimization for contextual understanding.arXiv preprint arXiv:2506.01274, 2025

Hosu Lee, Junho Kim, Hyunjun Kim, and Yong Man Ro. Refocus: Reinforcement-guided frame optimization for contextual understanding.arXiv preprint arXiv:2506.01274, 2025. 11

work page arXiv 2025
[37]

Self-chained image-language model for video localization and question answering.NeurIPS, 2023

Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Self-chained image-language model for video localization and question answering.NeurIPS, 2023

work page 2023
[38]

Cambrian-s: Towards spatial supersensing in video

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis L Brown II, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video. InThe Fourteenth International Conference on Learning Representations, 2025

work page 2025
[39]

Generative frame sampler for long video understanding

Linli Yao, Haoning Wu, Kun Ouyang, Yuanxing Zhang, Caiming Xiong, Bei Chen, Xu Sun, and Junnan Li. Generative frame sampler for long video understanding. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

work page 2025
[40]

Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning

Sara Ghazanfari, Francesco Croce, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, and Siddharth Garg. Chain-of-frames: Advancing video understanding in multimodal llms via frame-aware reasoning.arXiv preprint arXiv:2506.00318, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Adaptive keyframe sampling for long video understanding

Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. InCVPR, 2025

work page 2025
[42]

From frames to clips: Training-free adaptive key clip selection for long-form video understand- ing.arXiv preprint arXiv:2510.02262, 2025

Guangyu Sun, Archit Singhal, Burak Uzkent, Mubarak Shah, Chen Chen, and Garin Kessler. From frames to clips: Training-free adaptive key clip selection for long-form video understand- ing.arXiv preprint arXiv:2510.02262, 2025

work page arXiv 2025
[43]

Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms

Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, and Jian Luan. Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms. InICCV, 2025

work page 2025
[44]

Mdp3: A training-free approach for list-wise frame selection in video-llms

Hui Sun, Shiyin Lu, Huanyu Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Ming Li. Mdp3: A training-free approach for list-wise frame selection in video-llms. InICCV, 2025

work page 2025
[45]

Tem- poral chain of thought: Long-video understanding by thinking in frames.arXiv preprint arXiv:2507.02001, 2025

Anurag Arnab, Ahmet Iscen, Mathilde Caron, Alireza Fathi, and Cordelia Schmid. Tem- poral chain of thought: Long-video understanding by thinking in frames.arXiv preprint arXiv:2507.02001, 2025

work page arXiv 2025
[46]

In NeurIPS

Jian Hu, Zixu Cheng, Chenyang Si, Wei Li, and Shaogang Gong. Cos: Chain-of-shot prompting for long video understanding.arXiv preprint arXiv:2502.06428, 2025

work page arXiv 2025
[47]

Focus: Efficient keyframe selection for long video understanding.arXiv preprint arXiv:2510.27280, 2025

Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, and Yang You. Focus: Efficient keyframe selection for long video understanding.arXiv preprint arXiv:2510.27280, 2025

work page arXiv 2025
[48]

Bolt: Boost large vision-language model without training for long-form video understanding

Shuming Liu, Chen Zhao, Tianqi Xu, and Bernard Ghanem. Bolt: Boost large vision-language model without training for long-form video understanding. InCVPR, 2025

work page 2025
[49]

Adard-key: Adaptive relevance-diversity keyframe sampling for long-form video understanding.arXiv preprint arXiv:2510.02778, 2025

Xian Zhang, Zexi Wu, Zinuo Li, Hongming Xu, Luqi Gong, Farid Boussaid, Naoufel Werghi, and Mohammed Bennamoun. Adard-key: Adaptive relevance-diversity keyframe sampling for long-form video understanding.arXiv preprint arXiv:2510.02778, 2025

work page arXiv 2025
[50]

Maxinfo: A training-free key-frame selection method using maximum volume for enhanced video understanding

Pengyi Li, Irina Abdullaeva, Alexander Gambashidze, Andrey Kuznetsov, and Ivan Oseledets. Maxinfo: A training-free key-frame selection method using maximum volume for enhanced video understanding. InWACV, 2026

work page 2026
[51]

Elastictok: Adaptive tokenization for image and video.arXiv preprint arXiv:2410.08368, 2024

Wilson Yan, V olodymyr Mnih, Aleksandra Faust, Matei Zaharia, Pieter Abbeel, and Hao Liu. Elastictok: Adaptive tokenization for image and video.arXiv preprint arXiv:2410.08368, 2024

work page arXiv 2024
[52]

Evatok: Adaptive length video tokenization for efficient visual autoregressive generation.arXiv preprint arXiv:2603.12267, 2026

Tianwei Xiong, Jun Hao Liew, Zilong Huang, Zhijie Lin, Jiashi Feng, and Xihui Liu. Evatok: Adaptive length video tokenization for efficient visual autoregressive generation.arXiv preprint arXiv:2603.12267, 2026

work page arXiv 2026
[53]

Learning adaptive and temporally causal video tokenization in a 1d latent space, 2025

Yan Li, Changyao Tian, Renqiu Xia, Ning Liao, Weiwei Guo, Junchi Yan, Hongsheng Li, Jifeng Dai, Hao Li, and Xue Yang. Learning adaptive and temporally causal video tokenization in a 1d latent space, 2025

work page 2025
[54]

Infotok: Adaptive discrete video tokenizer via information- theoretic compression.arXiv preprint arXiv:2512.16975, 2025

Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, et al. Infotok: Adaptive discrete video tokenizer via information- theoretic compression.arXiv preprint arXiv:2512.16975, 2025. 12

work page arXiv 2025
[55]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[56]

Prunevid: Visual token pruning for efficient video large language models

Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Visual token pruning for efficient video large language models. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

work page 2025
[57]

Taylor videos for action recognition

Lei Wang, Xiuyuan Yuan, Tom Gedeon, and Liang Zheng. Taylor videos for action recognition. arXiv preprint arXiv:2402.03019, 2024

work page arXiv 2024
[58]

Unfolding videos dynamics via taylor expansion.arXiv preprint arXiv:2409.02371, 2024

Siyi Chen, Minkyu Choi, Zesen Zhao, Kuan Han, Qing Qu, and Zhongming Liu. Unfolding videos dynamics via taylor expansion.arXiv preprint arXiv:2409.02371, 2024

work page arXiv 2024
[59]

From reusing to forecasting: Accelerating diffusion models with taylorseers

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers. InICCV, 2025

work page 2025
[60]

Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

Hanshuai Cui, Zhiqing Tang, Zhi Yao, Fanshuai Meng, Weijia Jia, and Wei Zhao. Not all frames deserve full computation: Accelerating autoregressive video generation via selective computation and predictive extrapolation.arXiv preprint arXiv:2604.02979, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

SIAM, 2007

Randall J LeVeque.Finite difference methods for ordinary and partial differential equations: steady-state and time-dependent problems. SIAM, 2007

work page 2007
[62]

Approximate taylor methods for odes.Computers & Fluids, 159:156–166, 2017

work page 2017
[63]

Taylorswiftnet: Taylor driven temporal modeling for swift future frame prediction

Saber Pourheydari, Emad Bahrami, Mohsen Fayyaz, Gianpiero Francesca, Mehdi Noroozi, and Juergen Gall. Taylorswiftnet: Taylor driven temporal modeling for swift future frame prediction. arXiv preprint arXiv:2110.14392, 2021

work page arXiv 2021
[64]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley-Interscience, Hoboken, NJ, 2nd edition, 2006

work page 2006
[65]

Bishop.Pattern Recognition and Machine Learning

Christopher M. Bishop.Pattern Recognition and Machine Learning. Springer, New York, 2006

work page 2006
[66]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Fastvid: Dynamic density pruning for fast video large language models

Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large language models. arXiv preprint arXiv:2503.11187, 2025

work page arXiv 2025
[70]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025

work page 2025
[71]

Mlvu: Benchmarking multi-task long video understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InCVPR, 2025

work page 2025
[72]

Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024. 13

work page 2024
[73]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021

work page 2021
[74]

Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

work page 2024
[75]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Uni- comp: Rethinking video compression through informational uniqueness.arXiv preprint arXiv:2512.03575, 2025

Chao Yuan, Shimin Chen, Minliang Lin, Limeng Qiao, Guanglu Wan, and Lin Ma. Uni- comp: Rethinking video compression through informational uniqueness.arXiv preprint arXiv:2512.03575, 2025

work page arXiv 2025

[2] [2]

Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.Nature neuroscience, 2(1):79–87, 1999

Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.Nature neuroscience, 2(1):79–87, 1999

work page 1999

[3] [3]

The free-energy principle: a unified brain theory?Nature reviews neuroscience, 11(2):127–138, 2010

Karl Friston. The free-energy principle: a unified brain theory?Nature reviews neuroscience, 11(2):127–138, 2010

work page 2010

[4] [4]

Differential quantization of communication signals, July 29 1952

Cassius C Cutler. Differential quantization of communication signals, July 29 1952. US Patent 2,605,361

work page 1952

[5] [5]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava- video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, 2020

work page 2020

[9] [9]

Pyscenedetect.https://www.scenedetect.com/

work page

[10] [10]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Gmflow: Learning optical flow via global matching

Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, and Dacheng Tao. Gmflow: Learning optical flow via global matching. InCVPR, 2022

work page 2022

[12] [12]

Flowformer: A transformer architecture for optical flow

Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. In ECCV, 2022

work page 2022

[13] [13]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023

work page 2023

[14] [14]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, 2022

work page 2022

[15] [15]

Long-clip: Unlocking the long-text capability of clip

Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. InECCV, 2024

work page 2024

[16] [16]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. 2024

work page 2024

[17] [17]

Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

work page 2025

[18] [18]

Chat-univi: Unified visual representation empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In CVPR, 2024

work page 2024

[19] [19]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Llavanext: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024

work page 2024

[21] [21]

Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

work page arXiv 2024

[22] [22]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Slowfast-llava-1.5: A family of token-efficient video large language models for long-form video understanding.arXiv preprint arXiv:2503.18943, 2025

Mingze Xu, Mingfei Gao, Shiyu Li, Jiasen Lu, Zhe Gan, Zhengfeng Lai, Meng Cao, Kai Kang, Yinfei Yang, and Afshin Dehghan. Slowfast-llava-1.5: A family of token-efficient video large language models for long-form video understanding.arXiv preprint arXiv:2503.18943, 2025

work page arXiv 2025

[25] [25]

Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025

Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al. Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025

work page arXiv 2025

[26] [26]

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Scaling video-language models to 10k frames via hierarchical differential distillation.arXiv preprint arXiv:2504.02438, 2025

Chuanqi Cheng, Jian Guan, Wei Wu, and Rui Yan. Scaling video-language models to 10k frames via hierarchical differential distillation.arXiv preprint arXiv:2504.02438, 2025

work page arXiv 2025

[30] [30]

Video-xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InCVPR, 2025

work page 2025

[31] [31]

Revisiting the" video" in video-language understanding

Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the" video" in video-language understanding. InCVPR, 2022

work page 2022

[32] [32]

Flexible frame selection for efficient video reasoning

Shyamal Buch, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. Flexible frame selection for efficient video reasoning. InCVPR, 2025

work page 2025

[33] [33]

Frame-voyager: Learning to query frames for video large language models.arXiv preprint arXiv:2410.03226, 2024

Sicheng Yu, Chengkai Jin, Huanyu Wang, Zhenghao Chen, Sheng Jin, Zhongrong Zuo, Xiaolei Xu, Zhenbang Sun, Bingni Zhang, Jiawei Wu, et al. Frame-voyager: Learning to query frames for video large language models.arXiv preprint arXiv:2410.03226, 2024

work page arXiv 2024

[34] [34]

M-llm based video frame selection for efficient video understanding

Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, et al. M-llm based video frame selection for efficient video understanding. InCVPR, 2025

work page 2025

[35] [35]

Viarl: Adaptive temporal grounding via visual iterated amplification reinforcement learning.arXiv preprint arXiv:2505.15447, 2025

Ziqiang Xu, Qi Dai, Tian Xie, Yifan Yang, Kai Qiu, DongDong Chen, Zuxuan Wu, and Chong Luo. Viarl: Adaptive temporal grounding via visual iterated amplification reinforcement learning.arXiv preprint arXiv:2505.15447, 2025

work page arXiv 2025

[36] [36]

Refocus: Reinforcement-guided frame optimization for contextual understanding.arXiv preprint arXiv:2506.01274, 2025

Hosu Lee, Junho Kim, Hyunjun Kim, and Yong Man Ro. Refocus: Reinforcement-guided frame optimization for contextual understanding.arXiv preprint arXiv:2506.01274, 2025. 11

work page arXiv 2025

[37] [37]

Self-chained image-language model for video localization and question answering.NeurIPS, 2023

Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Self-chained image-language model for video localization and question answering.NeurIPS, 2023

work page 2023

[38] [38]

Cambrian-s: Towards spatial supersensing in video

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis L Brown II, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video. InThe Fourteenth International Conference on Learning Representations, 2025

work page 2025

[39] [39]

Generative frame sampler for long video understanding

Linli Yao, Haoning Wu, Kun Ouyang, Yuanxing Zhang, Caiming Xiong, Bei Chen, Xu Sun, and Junnan Li. Generative frame sampler for long video understanding. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

work page 2025

[40] [40]

Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning

Sara Ghazanfari, Francesco Croce, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, and Siddharth Garg. Chain-of-frames: Advancing video understanding in multimodal llms via frame-aware reasoning.arXiv preprint arXiv:2506.00318, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Adaptive keyframe sampling for long video understanding

Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. InCVPR, 2025

work page 2025

[42] [42]

From frames to clips: Training-free adaptive key clip selection for long-form video understand- ing.arXiv preprint arXiv:2510.02262, 2025

Guangyu Sun, Archit Singhal, Burak Uzkent, Mubarak Shah, Chen Chen, and Garin Kessler. From frames to clips: Training-free adaptive key clip selection for long-form video understand- ing.arXiv preprint arXiv:2510.02262, 2025

work page arXiv 2025

[43] [43]

Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms

Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, and Jian Luan. Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms. InICCV, 2025

work page 2025

[44] [44]

Mdp3: A training-free approach for list-wise frame selection in video-llms

Hui Sun, Shiyin Lu, Huanyu Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Ming Li. Mdp3: A training-free approach for list-wise frame selection in video-llms. InICCV, 2025

work page 2025

[45] [45]

Tem- poral chain of thought: Long-video understanding by thinking in frames.arXiv preprint arXiv:2507.02001, 2025

Anurag Arnab, Ahmet Iscen, Mathilde Caron, Alireza Fathi, and Cordelia Schmid. Tem- poral chain of thought: Long-video understanding by thinking in frames.arXiv preprint arXiv:2507.02001, 2025

work page arXiv 2025

[46] [46]

In NeurIPS

Jian Hu, Zixu Cheng, Chenyang Si, Wei Li, and Shaogang Gong. Cos: Chain-of-shot prompting for long video understanding.arXiv preprint arXiv:2502.06428, 2025

work page arXiv 2025

[47] [47]

Focus: Efficient keyframe selection for long video understanding.arXiv preprint arXiv:2510.27280, 2025

Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, and Yang You. Focus: Efficient keyframe selection for long video understanding.arXiv preprint arXiv:2510.27280, 2025

work page arXiv 2025

[48] [48]

Bolt: Boost large vision-language model without training for long-form video understanding

Shuming Liu, Chen Zhao, Tianqi Xu, and Bernard Ghanem. Bolt: Boost large vision-language model without training for long-form video understanding. InCVPR, 2025

work page 2025

[49] [49]

Adard-key: Adaptive relevance-diversity keyframe sampling for long-form video understanding.arXiv preprint arXiv:2510.02778, 2025

Xian Zhang, Zexi Wu, Zinuo Li, Hongming Xu, Luqi Gong, Farid Boussaid, Naoufel Werghi, and Mohammed Bennamoun. Adard-key: Adaptive relevance-diversity keyframe sampling for long-form video understanding.arXiv preprint arXiv:2510.02778, 2025

work page arXiv 2025

[50] [50]

Maxinfo: A training-free key-frame selection method using maximum volume for enhanced video understanding

Pengyi Li, Irina Abdullaeva, Alexander Gambashidze, Andrey Kuznetsov, and Ivan Oseledets. Maxinfo: A training-free key-frame selection method using maximum volume for enhanced video understanding. InWACV, 2026

work page 2026

[51] [51]

Elastictok: Adaptive tokenization for image and video.arXiv preprint arXiv:2410.08368, 2024

Wilson Yan, V olodymyr Mnih, Aleksandra Faust, Matei Zaharia, Pieter Abbeel, and Hao Liu. Elastictok: Adaptive tokenization for image and video.arXiv preprint arXiv:2410.08368, 2024

work page arXiv 2024

[52] [52]

Evatok: Adaptive length video tokenization for efficient visual autoregressive generation.arXiv preprint arXiv:2603.12267, 2026

Tianwei Xiong, Jun Hao Liew, Zilong Huang, Zhijie Lin, Jiashi Feng, and Xihui Liu. Evatok: Adaptive length video tokenization for efficient visual autoregressive generation.arXiv preprint arXiv:2603.12267, 2026

work page arXiv 2026

[53] [53]

Learning adaptive and temporally causal video tokenization in a 1d latent space, 2025

Yan Li, Changyao Tian, Renqiu Xia, Ning Liao, Weiwei Guo, Junchi Yan, Hongsheng Li, Jifeng Dai, Hao Li, and Xue Yang. Learning adaptive and temporally causal video tokenization in a 1d latent space, 2025

work page 2025

[54] [54]

Infotok: Adaptive discrete video tokenizer via information- theoretic compression.arXiv preprint arXiv:2512.16975, 2025

Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, et al. Infotok: Adaptive discrete video tokenizer via information- theoretic compression.arXiv preprint arXiv:2512.16975, 2025. 12

work page arXiv 2025

[55] [55]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[56] [56]

Prunevid: Visual token pruning for efficient video large language models

Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Visual token pruning for efficient video large language models. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

work page 2025

[57] [57]

Taylor videos for action recognition

Lei Wang, Xiuyuan Yuan, Tom Gedeon, and Liang Zheng. Taylor videos for action recognition. arXiv preprint arXiv:2402.03019, 2024

work page arXiv 2024

[58] [58]

Unfolding videos dynamics via taylor expansion.arXiv preprint arXiv:2409.02371, 2024

Siyi Chen, Minkyu Choi, Zesen Zhao, Kuan Han, Qing Qu, and Zhongming Liu. Unfolding videos dynamics via taylor expansion.arXiv preprint arXiv:2409.02371, 2024

work page arXiv 2024

[59] [59]

From reusing to forecasting: Accelerating diffusion models with taylorseers

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers. InICCV, 2025

work page 2025

[60] [60]

Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

Hanshuai Cui, Zhiqing Tang, Zhi Yao, Fanshuai Meng, Weijia Jia, and Wei Zhao. Not all frames deserve full computation: Accelerating autoregressive video generation via selective computation and predictive extrapolation.arXiv preprint arXiv:2604.02979, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[61] [61]

SIAM, 2007

Randall J LeVeque.Finite difference methods for ordinary and partial differential equations: steady-state and time-dependent problems. SIAM, 2007

work page 2007

[62] [62]

Approximate taylor methods for odes.Computers & Fluids, 159:156–166, 2017

work page 2017

[63] [63]

Taylorswiftnet: Taylor driven temporal modeling for swift future frame prediction

Saber Pourheydari, Emad Bahrami, Mohsen Fayyaz, Gianpiero Francesca, Mehdi Noroozi, and Juergen Gall. Taylorswiftnet: Taylor driven temporal modeling for swift future frame prediction. arXiv preprint arXiv:2110.14392, 2021

work page arXiv 2021

[64] [64]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley-Interscience, Hoboken, NJ, 2nd edition, 2006

work page 2006

[65] [65]

Bishop.Pattern Recognition and Machine Learning

Christopher M. Bishop.Pattern Recognition and Machine Learning. Springer, New York, 2006

work page 2006

[66] [66]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [67]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[69] [69]

Fastvid: Dynamic density pruning for fast video large language models

Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large language models. arXiv preprint arXiv:2503.11187, 2025

work page arXiv 2025

[70] [70]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025

work page 2025

[71] [71]

Mlvu: Benchmarking multi-task long video understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InCVPR, 2025

work page 2025

[72] [72]

Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024. 13

work page 2024

[73] [73]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021

work page 2021

[74] [74]

Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

work page 2024

[75] [75]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024