Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
Pith reviewed 2026-05-18 18:31 UTC · model grok-4.3
The pith
Aggregating output probabilities from parallel inferences on disjoint video frame subsets improves VideoLLM performance without extra training or longer context.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Video Parallel Scaling contracts the Chinchilla scaling law at inference time by processing disjoint frame subsets in parallel streams and aggregating their output probabilities, thereby integrating richer uncorrelated visual evidence and raising performance on video understanding tasks without any additional training or expansion of the context window.
What carries the argument
Aggregation of output probabilities across multiple parallel inference streams, each operating on a unique disjoint subset of the input video frames.
If this is right
- Performance improves consistently on Video-MME and EventHallusion for models from 2B to 32B parameters.
- The method scales more favorably than self-consistency and remains complementary to other decoding strategies.
- Temporal reasoning capabilities of VideoLLMs increase without raising memory usage from longer context windows.
- No retraining is required, so the technique applies directly to existing deployed models.
Where Pith is reading between the lines
- If the uncorrelated-evidence premise holds for other modalities, the same parallel-subset aggregation could be tested on image or audio models that currently hit context limits.
- The approach suggests a general way to trade parallel inference compute for effective data scaling, which could be quantified on controlled synthetic videos where frame correlation is known in advance.
- Memory-efficient inference becomes feasible for hour-long videos by keeping each stream short while still harvesting diverse evidence through aggregation.
Load-bearing premise
The visual evidence obtained from the different disjoint frame subsets is sufficiently uncorrelated that combining their probabilities produces a performance gain equivalent to contracting the Chinchilla scaling law.
What would settle it
Measure the statistical correlation between the per-token output distributions produced by different frame-subset streams; if gains vanish or reverse once measured correlation exceeds a modest threshold, the central claim is falsified.
Figures
read the original abstract
Video Large Language Models (VideoLLMs) face a critical bottleneck: increasing the number of input frames to capture fine-grained temporal detail leads to prohibitive computational costs and performance degradation from long context lengths. We introduce Video Parallel Scaling (VPS), an inference-time method that expands a model's perceptual bandwidth without increasing its context window. VPS operates by running multiple parallel inference streams, each processing a unique, disjoint subset of the video's frames. By aggregating the output probabilities from these complementary streams, VPS integrates a richer set of visual information than is possible with a single pass. We theoretically show that this approach effectively contracts the Chinchilla scaling law by leveraging uncorrelated visual evidence, thereby improving performance without additional training. Extensive experiments across various model architectures and scales (2B-32B) on benchmarks such as Video-MME and EventHallusion demonstrate that VPS consistently and significantly improves performance. It scales more favorably than other parallel alternatives (e.g. Self-consistency) and is complementary to other decoding strategies, offering a memory-efficient and robust framework for enhancing the temporal reasoning capabilities of VideoLLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Video Parallel Scaling (VPS), an inference-time technique for VideoLLMs. VPS runs multiple parallel inference streams on disjoint subsets of video frames and aggregates their output probabilities to integrate richer visual information without extending the context window or additional training. The authors claim this approach theoretically contracts the Chinchilla scaling law by leveraging uncorrelated visual evidence from the subsets. They report consistent empirical gains across model scales (2B–32B) on benchmarks including Video-MME and EventHallusion, with more favorable scaling than self-consistency and complementarity to other decoding strategies.
Significance. If the claimed contraction of the Chinchilla law holds, the work would offer a practical, memory-efficient route to increasing perceptual bandwidth in VideoLLMs at inference time. The broad experimental coverage across architectures and scales, together with the reported complementarity to existing decoding methods, constitutes a concrete strength. The absence of a first-principles derivation for the scaling-law claim, however, limits the current theoretical contribution.
major comments (2)
- [Abstract and Theoretical Analysis section] Abstract and Theoretical Analysis section: The central claim that VPS 'effectively contracts the Chinchilla scaling law by leveraging uncorrelated visual evidence' is load-bearing for the paper's novelty yet lacks any derivation. No steps are shown that start from the Chinchilla form L(N,D) ≈ E + A/N^α + B/D^β, introduce a correlation factor ρ between frame-subset outputs, and demonstrate an effective increase in D (or equivalent loss reduction) with the original exponents.
- [Method section] Method section: The aggregation procedure itself (whether logits or probabilities are averaged, how ties or final answer selection is handled, and the precise definition of 'disjoint' subsets) is not specified. This detail is required both to reproduce the reported gains and to evaluate whether the uncorrelated-evidence premise actually holds.
minor comments (2)
- [Experiments] The manuscript would benefit from error bars or statistical significance tests on the Video-MME and EventHallusion results to strengthen the empirical support.
- [Figure 1] Figure 1 or the VPS diagram could more explicitly illustrate the frame-subset partitioning and probability-aggregation step for clarity.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive suggestions. We address each of the major comments below and have prepared revisions to the manuscript to incorporate the requested clarifications and expansions.
read point-by-point responses
-
Referee: [Abstract and Theoretical Analysis section] Abstract and Theoretical Analysis section: The central claim that VPS 'effectively contracts the Chinchilla scaling law by leveraging uncorrelated visual evidence' is load-bearing for the paper's novelty yet lacks any derivation. No steps are shown that start from the Chinchilla form L(N,D) ≈ E + A/N^α + B/D^β, introduce a correlation factor ρ between frame-subset outputs, and demonstrate an effective increase in D (or equivalent loss reduction) with the original exponents.
Authors: We acknowledge that the derivation in the Theoretical Analysis section could benefit from greater explicitness. Although the manuscript outlines the conceptual basis for contracting the Chinchilla scaling law through uncorrelated visual evidence, we agree with the referee that a detailed step-by-step derivation is necessary. In the revised manuscript, we will add the following derivation in the Theoretical Analysis section: Starting from the Chinchilla form L(N,D) ≈ E + A/N^α + B/D^β, we model the effective data D_eff = D * (1 + (1-ρ) * (k-1)) where k is the number of subsets and ρ is the correlation between subset outputs. When ρ is low due to disjoint frames, D_eff increases, effectively contracting the scaling curve. We will include the full mathematical steps and assumptions. revision: yes
-
Referee: [Method section] Method section: The aggregation procedure itself (whether logits or probabilities are averaged, how ties or final answer selection is handled, and the precise definition of 'disjoint' subsets) is not specified. This detail is required both to reproduce the reported gains and to evaluate whether the uncorrelated-evidence premise actually holds.
Authors: We thank the referee for pointing out this ambiguity. In the revised Method section, we will specify that VPS averages the output probabilities (not logits) from each parallel inference stream. The 'disjoint' subsets are constructed by evenly partitioning the total frames into non-overlapping groups, ensuring no frame overlap. For final answer selection, we take the class or token with the maximum aggregated probability; in case of ties, we break them by selecting the first in lexicographical order. We will also add a discussion on how this aggregation leverages the uncorrelated evidence premise, supported by empirical correlation measurements in the experiments. revision: yes
Circularity Check
No circularity; theoretical claim rests on stated premise without reduction to inputs
full rationale
The paper claims to theoretically show that VPS contracts the Chinchilla scaling law via aggregation of uncorrelated visual evidence from disjoint frame subsets. The abstract presents this as a first-principles benefit supporting performance gains without additional training, but supplies no equations, fitted parameters, or self-citations that reduce the contraction result to the modeling assumption by construction. The uncorrelated-evidence premise is invoked to justify the scaling benefit and is treated as an input rather than derived within the visible text; empirical results on Video-MME and EventHallusion are reported as separate validation. No load-bearing step matches the enumerated circularity patterns, so the derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Outputs from disjoint frame subsets provide uncorrelated visual evidence whose aggregation contracts the Chinchilla scaling law
Reference graph
Works this paper leans on
-
[1]
URL https://openreview.net/forum?id= eoln5WgrPx. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks.arXiv preprint arXiv:2412.15204,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Parallel scaling law for language models.arXiv preprint arXiv:2505.10475,
Mouxiang Chen, Binyuan Hui, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Jianling Sun, Junyang Lin, and Zhongxin Liu. Parallel scaling law for language models.arXiv preprint arXiv:2505.10475,
-
[4]
URL https://openreview.net/forum?id= 6PmJoRfdaK. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
URL https: //arxiv.org/abs/2010.11929. Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InForty-first International Conference on Machine Learning,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[6]
Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. Framefusion: Combining similarity and importance for video token reduction on large visual language models.arXiv preprint arXiv:2501.01986,
-
[7]
Hongcheng Gao, Jiashu Qu, Jingyi Tang, Baolong Bi, Yue Liu, Hongyu Chen, Li Liang, Li Su, and Qingming Huang. Exploring hallucination of large multimodal models in video understanding: Benchmark, analysis and mitigation.arXiv preprint arXiv:2503.19622,
-
[8]
Model card, accessed 25/Jul/2025. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, pp. ...
work page 2025
-
[9]
URL https://openreview.net/forum?id= kIoBbc76Sy. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon- Young Lee, Seon Joo Kim, and Minho Shim. Multi-granular spatio-temporal token merging for training-free acceleration of video llms.arXiv preprint arXiv:2507.07990,
-
[11]
ISSN 2835-8856. URLhttps://openreview.net/forum? id=H4S4ETc8c9. Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Regularized best-of-n sampling to mitigate reward hacking for language model alignment
Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, and Kenshi Abe. Regularized best-of-n sampling to mitigate reward hacking for language model alignment. InICML 2024 Workshop on Models of Human Feedback for AI Alignment,
work page 2024
-
[13]
Improving LLM Video Understanding with 16 Frames Per Second
Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. Improving LLM Video Understanding with 16 Frames Per Second. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025c. Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. Mor...
-
[14]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
URLhttps://arxiv.org/abs/2108.12409. Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, and Michael S Ryoo. Understanding long videos with multimodal language models. InThe Thirteenth International Conference on Learning Representations,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Jingzhe Shi, Qinwei Ma, Hongyi Liu, Hang Zhao, Jeng-Neng Hwang, and Lei Li
URLhttps://openreview.net/forum?id=OxKi02I29I. Jingzhe Shi, Qinwei Ma, Hongyi Liu, Hang Zhao, Jeng-Neng Hwang, and Lei Li. Explaining context length scaling and bounds for language models.arXiv preprint arXiv:2502.01481,
-
[16]
RoFormer: Enhanced Transformer with Rotary Position Embedding
URLhttps://arxiv.org/abs/2104.09864. Hanshi Sun, Momin Haider, Ruiqi Zhang, Huitao Yang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, and Andrea Zanette. Fast Best-of-N Decoding via Speculative Rejection. In The Thirty-eighth Annual Conference on Neural Information Processing Systems,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Ujjwal Upadhyay, Mukul Ranjan, Zhiqiang Shen, and Mohamed Elhoseiny. Time blindness: Why video-language models can’t see what humans can?arXiv preprint arXiv:2505.24867,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Inference compute-optimal video vision language models.arXiv preprint arXiv:2505.18855, 2025a
Peiqi Wang, ShengYun Peng, Xuewen Zhang, Hanchao Yu, Yibo Yang, Lifu Huang, Fujun Liu, and Qifan Wang. Inference compute-optimal video vision language models.arXiv preprint arXiv:2505.18855, 2025a. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought r...
-
[21]
Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Effective long-context scaling of foundation models
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of foundation models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (...
work page 2024
-
[23]
Slowfast-llava: A strong training-free baseline for video large language models
Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, and Afshin Dehghan. Slowfast-llava: A strong training-free baseline for video large language models. arXiv preprint arXiv:2407.15841,
-
[24]
Eventhallusion: Diagnosing event hallucinations in videoLLMs.arXiv preprint arXiv:2409.16597,
Jiacheng Zhang, Yang Jiao, Shaoxiang Chen, Na Zhao, Zhiyu Tan, Hao Li, and Jingjing Chen. Eventhallusion: Diagnosing event hallucinations in videoLLMs.arXiv preprint arXiv:2409.16597,
-
[25]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
13 Preprint A PROOFS In Chen et al. (2025), the inputs to the parallel streams are learnable transformation of the same input x, so that one can assume that each parallel stream follows (7) in an unbiased way, leading to a simplification in the analysis of the parallel scaling law. We start by reviewing the result from Chen et al. (2025). Lemma 1(Chen et ...
work page 2025
-
[27]
6, we compare the results of logit averaging and probability averaging when implementing VPS
+O(∆ 3)(41) B FURTHERRESULTS Probability and logit averagingIn Tab. 6, we compare the results of logit averaging and probability averaging when implementing VPS. Across different model classes, we find that both approaches lead to similar results. Thus, while we assume probability averaging in the theoretical analysis for simplicity, we resort to logit av...
-
[28]
(2024) and use the prompt specified in Tab
when evaluating the free form descriptions of the video, we follow Zhang et al. (2024) and use the prompt specified in Tab
work page 2024
-
[29]
Let x′ be the frame-dropped version of the sub-sampled video
C.4 INCORPORATING OTHER STRATEGIES For TCD, we construct a negative stream so that the half the frames are zeroed-out in an interleaved fashion. Let x′ be the frame-dropped version of the sub-sampled video. Then, TCD is implemented with ˜pθ(y|x) = (1 +α)p θ(y|x)−αp θ(y|x′),(42) where α∈[0,1) is a constant. Additionally, we set a hyperparameter β∈[0,1] tha...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.