Swift Sampling: Selecting Temporal Surprises via Taylor Series
Pith reviewed 2026-05-22 05:51 UTC · model grok-4.3
The pith
Swift Sampling selects high-information frames by detecting deviations from a Taylor-predicted trajectory in visual latent space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By modeling video features as a smooth trajectory and using Taylor expansion to approximate the future evolution, frames that deviate strongly from the predicted manifold can be identified as temporally surprising and selected, yielding better downstream performance than uniform or prior query-agnostic sampling especially under tight frame budgets.
What carries the argument
Low-order Taylor expansion of feature velocity and acceleration to predict subsequent frames and flag large deviations from the expected manifold.
If this is right
- Raises accuracy by up to 12.5 points over uniform sampling on long-video QA benchmarks.
- Adds only 0.02 times the baseline compute cost.
- Outperforms prior training-free methods across ten downstream tasks without video-specific tuning.
- Delivers the largest relative benefit when frame budgets are severely limited.
Where Pith is reading between the lines
- The same deviation-from-prediction logic could be tested on audio or text sequences for efficient long-sequence processing.
- Higher-order terms in the expansion might capture more complex motions without added training.
- Adding task-specific query signals to the surprise score could further improve selection for particular questions.
Load-bearing premise
That large deviations from the low-order Taylor-predicted path in visual latent space correspond to the most informative frames.
What would settle it
Measure whether accuracy gains disappear on videos whose feature trajectories change abruptly and cannot be approximated well by a low-order polynomial.
Figures
read the original abstract
While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Swift Sampling, a training-free frame selection algorithm for long videos. It models a video as a differentiable trajectory in visual latent space, computes velocity and acceleration of features, applies low-order Taylor expansion to predict the expected path of subsequent frames, and selects frames that diverge sharply from this predicted manifold as temporally surprising high-information frames. The paper claims this outperforms uniform sampling and prior query-agnostic baselines across three long-video question answering benchmarks and 10 downstream tasks, with accuracy gains up to +12.5 points especially under limited frame budgets for long videos, while incurring only 0.02x additional computational cost.
Significance. If the empirical results prove robust and the modeling assumptions hold under scrutiny, the approach could provide a lightweight, query-agnostic alternative for efficient video processing in resource-constrained settings, drawing an interesting parallel to predictive coding. The low overhead and lack of training or auxiliary networks are potential strengths. However, the current lack of visible derivation details, error bars, dataset descriptions, and ablation evidence limits assessment of whether the gains are reliable or generalizable.
major comments (2)
- Abstract: The central claim of up to +12.5 point accuracy gains on long-video QA benchmarks with tight frame budgets is presented without derivation details, error bars, dataset descriptions, ablation studies, or statistical tests. This leaves the superiority over uniform sampling and baselines as an unverified assertion rather than a substantiated result.
- Method (modeling assumption): The assumption that video features form a locally smooth differentiable trajectory well-approximated by low-order Taylor expansion, such that large residuals reliably mark task-relevant content, is not justified. Typical extractors produce discrete, non-stationary sequences with abrupt jumps (scene cuts, camera motion); if the order or time parameterization is misspecified, selected frames may be outliers or noise rather than informative, directly undermining the performance claims.
minor comments (2)
- Abstract: Specify the exact '10 different downstream tasks' and name the 'prior query-agnostic baselines' with citations for reproducibility.
- Abstract: Quantify the '30x cheaper overhead' claim with explicit comparisons to named leading baselines and precise overhead measurements.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of results and the justification of modeling choices.
read point-by-point responses
-
Referee: Abstract: The central claim of up to +12.5 point accuracy gains on long-video QA benchmarks with tight frame budgets is presented without derivation details, error bars, dataset descriptions, ablation studies, or statistical tests. This leaves the superiority over uniform sampling and baselines as an unverified assertion rather than a substantiated result.
Authors: We agree that the abstract would benefit from additional context to support the claims. In the revised manuscript, we have updated the abstract to reference the three benchmarks and note the inclusion of error bars. Full derivation details for the Taylor expansion appear in Section 2, dataset descriptions and ablation studies are expanded in Section 4, and statistical significance tests have been added to the results tables and supplementary material. These changes make the empirical support explicit. revision: yes
-
Referee: Method (modeling assumption): The assumption that video features form a locally smooth differentiable trajectory well-approximated by low-order Taylor expansion, such that large residuals reliably mark task-relevant content, is not justified. Typical extractors produce discrete, non-stationary sequences with abrupt jumps (scene cuts, camera motion); if the order or time parameterization is misspecified, selected frames may be outliers or noise rather than informative, directly undermining the performance claims.
Authors: We acknowledge that global smoothness does not always hold due to scene cuts and camera motion. Our approach mitigates this by applying low-order Taylor expansions over short local temporal windows, where the residual still highlights deviations from the predicted path. We have added a dedicated paragraph in Section 2.2 justifying the local approximation, analyzing residual behavior at discontinuities, and reporting ablations on expansion order and window size that demonstrate robustness. The consistent gains across 10 downstream tasks indicate that selected frames carry task-relevant information rather than noise. revision: yes
Circularity Check
No circularity: Taylor-based sampling is a self-contained heuristic
full rationale
The paper presents Swift Sampling as a training-free heuristic that models video features as a trajectory, computes first- and second-order differences, and uses low-order Taylor expansion to flag large residuals as 'surprises.' No equations or text in the provided manuscript reduce the selection rule to a fitted parameter on the target benchmarks, a self-citation chain, or a definition that presupposes the output. Performance numbers (+12.5 points) are reported as downstream empirical results rather than as a mathematical identity. The derivation therefore remains independent of its own claims.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A video can be modeled as a differentiable trajectory in visual latent space whose short-term evolution is well approximated by Taylor expansion.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames
-
IndisputableMonolith/Foundation/AbsoluteFloorClosureabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the Taylor residual– the ℓ2 distance between the predicted and the observed feature – serves as a principled, per-frame informativeness score
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Chao Yuan, Shimin Chen, Minliang Lin, Limeng Qiao, Guanglu Wan, and Lin Ma. Uni- comp: Rethinking video compression through informational uniqueness.arXiv preprint arXiv:2512.03575, 2025
-
[2]
Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.Nature neuroscience, 2(1):79–87, 1999
work page 1999
-
[3]
The free-energy principle: a unified brain theory?Nature reviews neuroscience, 11(2):127–138, 2010
Karl Friston. The free-energy principle: a unified brain theory?Nature reviews neuroscience, 11(2):127–138, 2010
work page 2010
-
[4]
Differential quantization of communication signals, July 29 1952
Cassius C Cutler. Differential quantization of communication signals, July 29 1952. US Patent 2,605,361
work page 1952
-
[5]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava- video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Raft: Recurrent all-pairs field transforms for optical flow
Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, 2020
work page 2020
-
[9]
Pyscenedetect.https://www.scenedetect.com/
-
[10]
Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Gmflow: Learning optical flow via global matching
Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, and Dacheng Tao. Gmflow: Learning optical flow via global matching. InCVPR, 2022
work page 2022
-
[12]
Flowformer: A transformer architecture for optical flow
Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. In ECCV, 2022
work page 2022
-
[13]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023
work page 2023
-
[14]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, 2022
work page 2022
-
[15]
Long-clip: Unlocking the long-text capability of clip
Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. InECCV, 2024
work page 2024
-
[16]
Video-llava: Learning united visual representation by alignment before projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. 2024
work page 2024
-
[17]
Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025
work page 2025
-
[18]
Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In CVPR, 2024
work page 2024
-
[19]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Llavanext: Improved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024
work page 2024
-
[21]
Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024
-
[22]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Mingze Xu, Mingfei Gao, Shiyu Li, Jiasen Lu, Zhe Gan, Zhengfeng Lai, Meng Cao, Kai Kang, Yinfei Yang, and Afshin Dehghan. Slowfast-llava-1.5: A family of token-efficient video large language models for long-form video understanding.arXiv preprint arXiv:2503.18943, 2025
-
[25]
Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025
Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al. Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025
-
[26]
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Long Context Transfer from Language to Vision
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Chuanqi Cheng, Jian Guan, Wei Wu, and Rui Yan. Scaling video-language models to 10k frames via hierarchical differential distillation.arXiv preprint arXiv:2504.02438, 2025
-
[30]
Video-xl: Extra-long vision language model for hour-scale video understanding
Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InCVPR, 2025
work page 2025
-
[31]
Revisiting the" video" in video-language understanding
Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the" video" in video-language understanding. InCVPR, 2022
work page 2022
-
[32]
Flexible frame selection for efficient video reasoning
Shyamal Buch, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. Flexible frame selection for efficient video reasoning. InCVPR, 2025
work page 2025
-
[33]
Sicheng Yu, Chengkai Jin, Huanyu Wang, Zhenghao Chen, Sheng Jin, Zhongrong Zuo, Xiaolei Xu, Zhenbang Sun, Bingni Zhang, Jiawei Wu, et al. Frame-voyager: Learning to query frames for video large language models.arXiv preprint arXiv:2410.03226, 2024
-
[34]
M-llm based video frame selection for efficient video understanding
Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, et al. M-llm based video frame selection for efficient video understanding. InCVPR, 2025
work page 2025
-
[35]
Ziqiang Xu, Qi Dai, Tian Xie, Yifan Yang, Kai Qiu, DongDong Chen, Zuxuan Wu, and Chong Luo. Viarl: Adaptive temporal grounding via visual iterated amplification reinforcement learning.arXiv preprint arXiv:2505.15447, 2025
-
[36]
Hosu Lee, Junho Kim, Hyunjun Kim, and Yong Man Ro. Refocus: Reinforcement-guided frame optimization for contextual understanding.arXiv preprint arXiv:2506.01274, 2025. 11
-
[37]
Self-chained image-language model for video localization and question answering.NeurIPS, 2023
Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Self-chained image-language model for video localization and question answering.NeurIPS, 2023
work page 2023
-
[38]
Cambrian-s: Towards spatial supersensing in video
Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis L Brown II, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video. InThe Fourteenth International Conference on Learning Representations, 2025
work page 2025
-
[39]
Generative frame sampler for long video understanding
Linli Yao, Haoning Wu, Kun Ouyang, Yuanxing Zhang, Caiming Xiong, Bei Chen, Xu Sun, and Junnan Li. Generative frame sampler for long video understanding. InFindings of the Association for Computational Linguistics: ACL 2025, 2025
work page 2025
-
[40]
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning
Sara Ghazanfari, Francesco Croce, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, and Siddharth Garg. Chain-of-frames: Advancing video understanding in multimodal llms via frame-aware reasoning.arXiv preprint arXiv:2506.00318, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Adaptive keyframe sampling for long video understanding
Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. InCVPR, 2025
work page 2025
-
[42]
Guangyu Sun, Archit Singhal, Burak Uzkent, Mubarak Shah, Chen Chen, and Garin Kessler. From frames to clips: Training-free adaptive key clip selection for long-form video understand- ing.arXiv preprint arXiv:2510.02262, 2025
-
[43]
Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms
Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, and Jian Luan. Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms. InICCV, 2025
work page 2025
-
[44]
Mdp3: A training-free approach for list-wise frame selection in video-llms
Hui Sun, Shiyin Lu, Huanyu Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Ming Li. Mdp3: A training-free approach for list-wise frame selection in video-llms. InICCV, 2025
work page 2025
-
[45]
Anurag Arnab, Ahmet Iscen, Mathilde Caron, Alireza Fathi, and Cordelia Schmid. Tem- poral chain of thought: Long-video understanding by thinking in frames.arXiv preprint arXiv:2507.02001, 2025
-
[46]
Jian Hu, Zixu Cheng, Chenyang Si, Wei Li, and Shaogang Gong. Cos: Chain-of-shot prompting for long video understanding.arXiv preprint arXiv:2502.06428, 2025
-
[47]
Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, and Yang You. Focus: Efficient keyframe selection for long video understanding.arXiv preprint arXiv:2510.27280, 2025
-
[48]
Bolt: Boost large vision-language model without training for long-form video understanding
Shuming Liu, Chen Zhao, Tianqi Xu, and Bernard Ghanem. Bolt: Boost large vision-language model without training for long-form video understanding. InCVPR, 2025
work page 2025
-
[49]
Xian Zhang, Zexi Wu, Zinuo Li, Hongming Xu, Luqi Gong, Farid Boussaid, Naoufel Werghi, and Mohammed Bennamoun. Adard-key: Adaptive relevance-diversity keyframe sampling for long-form video understanding.arXiv preprint arXiv:2510.02778, 2025
-
[50]
Pengyi Li, Irina Abdullaeva, Alexander Gambashidze, Andrey Kuznetsov, and Ivan Oseledets. Maxinfo: A training-free key-frame selection method using maximum volume for enhanced video understanding. InWACV, 2026
work page 2026
-
[51]
Elastictok: Adaptive tokenization for image and video.arXiv preprint arXiv:2410.08368, 2024
Wilson Yan, V olodymyr Mnih, Aleksandra Faust, Matei Zaharia, Pieter Abbeel, and Hao Liu. Elastictok: Adaptive tokenization for image and video.arXiv preprint arXiv:2410.08368, 2024
-
[52]
Tianwei Xiong, Jun Hao Liew, Zilong Huang, Zhijie Lin, Jiashi Feng, and Xihui Liu. Evatok: Adaptive length video tokenization for efficient visual autoregressive generation.arXiv preprint arXiv:2603.12267, 2026
-
[53]
Learning adaptive and temporally causal video tokenization in a 1d latent space, 2025
Yan Li, Changyao Tian, Renqiu Xia, Ning Liao, Weiwei Guo, Junchi Yan, Hongsheng Li, Jifeng Dai, Hao Li, and Xue Yang. Learning adaptive and temporally causal video tokenization in a 1d latent space, 2025
work page 2025
-
[54]
Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, et al. Infotok: Adaptive discrete video tokenizer via information- theoretic compression.arXiv preprint arXiv:2512.16975, 2025. 12
-
[55]
Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[56]
Prunevid: Visual token pruning for efficient video large language models
Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Visual token pruning for efficient video large language models. InFindings of the Association for Computational Linguistics: ACL 2025, 2025
work page 2025
-
[57]
Taylor videos for action recognition
Lei Wang, Xiuyuan Yuan, Tom Gedeon, and Liang Zheng. Taylor videos for action recognition. arXiv preprint arXiv:2402.03019, 2024
-
[58]
Unfolding videos dynamics via taylor expansion.arXiv preprint arXiv:2409.02371, 2024
Siyi Chen, Minkyu Choi, Zesen Zhao, Kuan Han, Qing Qu, and Zhongming Liu. Unfolding videos dynamics via taylor expansion.arXiv preprint arXiv:2409.02371, 2024
-
[59]
From reusing to forecasting: Accelerating diffusion models with taylorseers
Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers. InICCV, 2025
work page 2025
-
[60]
Hanshuai Cui, Zhiqing Tang, Zhi Yao, Fanshuai Meng, Weijia Jia, and Wei Zhao. Not all frames deserve full computation: Accelerating autoregressive video generation via selective computation and predictive extrapolation.arXiv preprint arXiv:2604.02979, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[61]
Randall J LeVeque.Finite difference methods for ordinary and partial differential equations: steady-state and time-dependent problems. SIAM, 2007
work page 2007
-
[62]
Approximate taylor methods for odes.Computers & Fluids, 159:156–166, 2017
work page 2017
-
[63]
Taylorswiftnet: Taylor driven temporal modeling for swift future frame prediction
Saber Pourheydari, Emad Bahrami, Mohsen Fayyaz, Gianpiero Francesca, Mehdi Noroozi, and Juergen Gall. Taylorswiftnet: Taylor driven temporal modeling for swift future frame prediction. arXiv preprint arXiv:2110.14392, 2021
-
[64]
Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley-Interscience, Hoboken, NJ, 2nd edition, 2006
work page 2006
-
[65]
Bishop.Pattern Recognition and Machine Learning
Christopher M. Bishop.Pattern Recognition and Machine Learning. Springer, New York, 2006
work page 2006
-
[66]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
Fastvid: Dynamic density pruning for fast video large language models
Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large language models. arXiv preprint arXiv:2503.11187, 2025
-
[70]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025
work page 2025
-
[71]
Mlvu: Benchmarking multi-task long video understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InCVPR, 2025
work page 2025
-
[72]
Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024. 13
work page 2024
-
[73]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021
work page 2021
-
[74]
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024
work page 2024
-
[75]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 14
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.