Recognition: 2 theorem links
· Lean TheoremLongVILA: Scaling Long-Context Visual Language Models for Long Videos
Pith reviewed 2026-05-17 03:47 UTC · model grok-4.3
The pith
LongVILA scales visual-language models from 8 to 2048 video frames while reaching 99.8 percent accuracy on million-token needle-in-a-haystack retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LongVILA upgrades VLMs to support long video understanding by incorporating long context extension and long video supervised fine-tuning, supported by the long-context Multi-Modal Sequence Parallelism (MM-SP) system that enables 2M context length training on 256 GPUs without gradient checkpointing, extending video frames from 8 to 2048 and achieving 99.8 percent accuracy on 6000-frame video needle-in-a-haystack tasks containing more than one million tokens.
What carries the argument
The two-stage training pipeline of long-context extension followed by long-video supervised fine-tuning, accelerated by Multi-Modal Sequence Parallelism (MM-SP) that parallelizes multi-modal sequences across devices.
If this is right
- The 7B LongVILA model reaches 65.1 percent on VideoMME with subtitles and competitive scores on nine other video benchmarks.
- MM-SP delivers 2.1x to 5.7x speedups over ring-style sequence parallelism and 1.1x to 1.4x over Megatron hybrid parallelism.
- Training reaches 2 million token context lengths on 256 GPUs without any gradient checkpointing.
- The system integrates directly with Hugging Face Transformers for both training and inference.
Where Pith is reading between the lines
- If MM-SP generalizes cleanly, similar parallelism could be applied to long audio or multi-turn conversation models that also face memory bottlenecks.
- The reported frame scaling suggests that future work could test whether the same stages preserve performance when the model size is increased beyond 7B.
- Real-world video search or summarization pipelines might adopt this approach once inference latency on long inputs is measured end-to-end.
Load-bearing premise
That the two training stages plus MM-SP can be combined without introducing hidden accuracy drops or unstated data-selection effects when moving from short clips to videos with thousands of frames.
What would settle it
Training the same 7B model on a 6000-frame video set and measuring whether needle-in-a-haystack accuracy falls below 95 percent or requires more than 256 GPUs with checkpointing.
read the original abstract
Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long video supervised fine-tuning. However, training on long video is computationally and memory intensive. We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes long video training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing. LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack. LongVILA-7B demonstrates strong accuracy on 9 popular video benchmarks, e.g. 65.1% VideoMME with subtitle. Besides, MM-SP is 2.1x - 5.7x faster than ring style sequence parallelism and 1.1x - 1.4x faster than Megatron with a hybrid context and tensor parallelism. Moreover, it seamlessly integrates with Hugging Face Transformers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LongVILA as a full-stack solution for scaling visual-language models to long video contexts. It extends VILA via two training stages (long-context extension followed by long-video supervised fine-tuning) and introduces Multi-Modal Sequence Parallelism (MM-SP) to enable efficient training and inference on sequences up to 2M tokens across 256 GPUs without gradient checkpointing. Key claims include extending supported video frames from 8 to 2048, 99.8% accuracy on a 6000-frame (>1M token) video needle-in-a-haystack task, strong results on 9 video benchmarks (e.g., 65.1% on VideoMME with subtitles), and 2.1x–5.7x speedups over ring-style sequence parallelism plus 1.1x–1.4x over Megatron with hybrid parallelism, with seamless Hugging Face Transformers integration.
Significance. If the empirical results hold under scrutiny, the work provides a practical demonstration of scaling VLMs to extreme video lengths while maintaining efficiency, which could accelerate progress in long-video understanding applications. The MM-SP system offers concrete engineering value for parallelizing multimodal training on long contexts, and the reported benchmark gains plus framework compatibility support broader adoption. The concrete accuracy numbers on NIAH and speedups are strengths that, if reproducible, advance both algorithmic and systems aspects of multimodal long-context modeling.
major comments (2)
- [Abstract / Training Methodology] Abstract and training description: the central scaling claim rests on the two-stage process (long context extension + long video SFT) plus MM-SP preserving accuracy and efficiency up to 2048 frames. However, no pre-/post-SFT comparisons on shorter-context or standard video tasks are reported, leaving open the possibility of hidden regressions that would undermine the 'extends from 8 to 2048 frames' claim without performance trade-offs.
- [Experimental Results] Experimental results on NIAH: the 99.8% accuracy on the 6000-frame (>1M token) needle-in-a-haystack is a load-bearing result for the long-context capability. The manuscript provides no details on needle/haystack construction, number of trials, variance across runs, or failure modes, making it difficult to evaluate whether the result generalizes or depends on specific data selection.
minor comments (2)
- [Abstract] The abstract states results on '9 popular video benchmarks' but does not enumerate them; the main text should include an explicit list with per-benchmark scores for clarity.
- [System Evaluation] MM-SP speedup claims (2.1x–5.7x vs. ring style) would benefit from a table specifying exact context lengths, model sizes, and hardware configurations used in the timing experiments.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications and commitments to revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract / Training Methodology] Abstract and training description: the central scaling claim rests on the two-stage process (long context extension + long video SFT) plus MM-SP preserving accuracy and efficiency up to 2048 frames. However, no pre-/post-SFT comparisons on shorter-context or standard video tasks are reported, leaving open the possibility of hidden regressions that would undermine the 'extends from 8 to 2048 frames' claim without performance trade-offs.
Authors: We agree that explicit pre- and post-SFT comparisons on shorter-context tasks would strengthen the presentation of the scaling claim. The long-context extension stage is intended to preserve short-context capabilities, as indirectly supported by the final model's strong results across the 9 video benchmarks. In the revised manuscript, we will add a new table or section reporting performance of the base VILA, the long-context extended checkpoint, and the final LongVILA on standard short-video tasks to directly demonstrate the absence of regressions. revision: yes
-
Referee: [Experimental Results] Experimental results on NIAH: the 99.8% accuracy on the 6000-frame (>1M token) needle-in-a-haystack is a load-bearing result for the long-context capability. The manuscript provides no details on needle/haystack construction, number of trials, variance across runs, or failure modes, making it difficult to evaluate whether the result generalizes or depends on specific data selection.
Authors: We acknowledge that additional methodological details are warranted for reproducibility and to allow readers to assess robustness. The revised manuscript will include an expanded description (in the main text or appendix) of the needle-in-a-haystack construction, including haystack sampling, needle insertion procedure, number of trials, variance or standard deviation across runs, and any observed failure cases. This will clarify that the 99.8% result reflects consistent performance rather than isolated data selection. revision: yes
Circularity Check
No circularity: empirical system-building with externally checkable results
full rationale
The paper presents an engineering solution for scaling VILA to long videos via two-stage training (long-context extension then long-video SFT) plus the MM-SP parallelism system. No equations, derivations, or first-principles predictions appear that reduce to fitted parameters or self-referential definitions by construction. Reported outcomes (99.8% on 6000-frame NIAH, 65.1% on VideoMME, 2.1x-5.7x speedups vs. ring SP) are concrete empirical measurements on public benchmarks that can be reproduced or falsified independently. Any self-citations are incidental and not load-bearing for uniqueness theorems or ansatzes; the central claims rest on experimental measurements rather than tautological renaming or fitted-input predictions.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 19 Pith papers
-
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.
-
CATS: Curvature Aware Temporal Selection for efficient long video understanding
CATS uses temporal curvature of query-frame relevance to select informative frames, achieving 93-95% of heavy multi-stage accuracy at 3-4% of the preprocessing cost on long-video benchmarks.
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.
-
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
-
Internalized Reasoning for Long-Context Visual Document Understanding
A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
-
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
-
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
-
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
-
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
-
Small Vision-Language Models are Smart Compressors for Long Video Understanding
Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.
-
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
-
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
-
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Spatial-MLLM boosts MLLM spatial intelligence from 2D inputs via dual encoders initialized from geometry models plus space-aware sampling, claiming state-of-the-art results.
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
-
EgoSelf: From Memory to Personalized Egocentric Assistant
EgoSelf uses graph-based memory of user interactions to derive personalized profiles and predict future behaviors for egocentric assistants.
-
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling
EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.
-
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding be...
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choroman- ski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[5]
Sharegpt4video: Improving video understanding and genera- tion with better captions
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and genera- tion with better captions. arXiv preprint arXiv:2406.04325, 2024a. Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient...
-
[6]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fine-tuning of long-context large language models. In The International Conference on Learning Representations, 2024b. Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jia...
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multi- modal language model. arXiv preprint arXiv:2303.03378,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Towards event-oriented long video understanding
Yifan Du, Kun Zhou, Yuqi Huo, Yifan Li, Wayne Xin Zhao, Haoyu Lu, Zijia Zhao, Bingning Wang, Weipeng Chen, and Ji-Rong Wen. Towards event-oriented long video understanding. CoRR, abs/2406.14129,
-
[10]
Usp: A unified sequence parallelism approach for long context generative ai
Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai. arXiv preprint arXiv:2405.07719,
-
[11]
Vila2: Vila augmented vila.arXiv preprint arXiv:2407.17453, 2024
Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jang Hyun Cho, Marco Pavone, Song Han, and Hongxu Yin. Vila2: Vila augmented vila. arXiv preprint arXiv:2407.17453,
-
[12]
Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos.CoRR, abs/2408.14023,
-
[13]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. CoRR, abs/240...
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Shot2story20k: A new benchmark for comprehensive understanding of multi-shot videos
Mingfei Han, Linjie Yang, Xiaojun Chang, and Heng Wang. Shot2story20k: A new benchmark for comprehensive understanding of multi-shot videos. arXiv preprint arXiv:2311.17043,
-
[15]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Ra- jbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. CoRR, abs/2311.08046,
-
[17]
Visualwebarena: Evaluating multimodal agents on realistic visual web tasks
12 Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649,
-
[18]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[19]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. CoRR, abs/2408.03326, 2024a. Dacheng Li, Rulin Shao, Anze Xie, Eric Xing, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. Lightseq: : Sequence level parallelism for distributed training ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models, 2024c. Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. CoRR, a...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model par- allelism. arXiv preprint arXiv:1909.08053,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[23]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Longvlm: Efficient long video understanding via large language models
Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video understanding via large language models. CoRR, abs/2404.03384,
-
[26]
Longvideobench: A benchmark for long-context interleaved video-language understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. CoRR, abs/2407.15754,
-
[27]
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See-Kiong Ng, and Jiashi Feng. Pllava : Parameter- free llava extension from images to videos for video dense captioning. CoRR, abs/2404.16994,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
X-VILA: cross-modality alignment for large language model
Hanrong Ye, De-An Huang, Yao Lu, Zhiding Yu, Wei Ping, Andrew Tao, Jan Kautz, Song Han, Dan Xu, Pavlo Molchanov, and Hongxu Yin. X-VILA: cross-modality alignment for large language model. CoRR, abs/2405.19335,
-
[29]
Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash- vstream: Memory-based real-time understanding for long video streams. CoRR, abs/2406.08085, 2024a. Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from...
-
[30]
Accessed: 2024-07-28. 15 A A PPENDIX A.1 L ONG VILA-C APTION We have developed a long video captioning benchmark, LongVILA-Caption, consisting of 100 long videos, with captions generated as detailed in Section 3.3, and verified through human examination. In line with the methodology of VideoChatGPT (Maaz et al., 2024), we evaluate the predictions of each ...
work page 2024
-
[31]
As the number of frames increases, the model’s per- formance improves significantly. Specifically, the average scores rise from 2.00 to 3.26, highlighting the model’s enhanced capability in generating accurate and rich captions with more frames. Table 5: Iteration time (seconds) on the dataset (Chen et al., 2024a) with and with- out our two-stage sharding...
work page 2023
-
[32]
or Zero-3 (Ra- jbhandari et al., 2020), on 32 H100 GPUs. We found that FSDP offers more efficient memory management, which led us to select it as our default configuration. (Time per iteration, seconds). Sequence Length Zero-3 FSDP ZIGZAG-RINGATTN Ulysses 2D Attention ZIGZAG-RINGATTN Ulysses 2D Attention 320 K OOM OOM OOM 23.57 10.70 11.12 288 K OOM OOM O...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.