arxiv: 2408.10188 · v6 · submitted 2024-08-19 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Yukang Chen , Fuzhao Xue , Dacheng Li , Qinghao Hu , Ligeng Zhu , Xiuyu Li , Yunhao Fang , Haotian Tang

show 10 more authors

Shang Yang Zhijian Liu Ethan He Hongxu Yin Pavlo Molchanov Jan Kautz Linxi Fan Yuke Zhu Yao Lu Song Han

Authors on Pith no claims yet

Pith reviewed 2026-05-17 03:47 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords long-context VLMvideo understandingsequence parallelismneedle-in-a-haystackmulti-modal trainingcontext extensionvideo benchmarks

0 comments

The pith

LongVILA scales visual-language models from 8 to 2048 video frames while reaching 99.8 percent accuracy on million-token needle-in-a-haystack retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LongVILA as a full-stack upgrade to existing visual language models that adds two training stages: first extending context length across many frames, then fine-tuning on long video data. It pairs this with a new Multi-Modal Sequence Parallelism system that distributes the heavy computation across GPUs so that training on thousands of frames becomes practical without gradient checkpointing or prohibitive memory costs. A reader would care because most current video models still collapse when asked to reason over hours of footage; success here would directly improve tasks that require tracking events across extended sequences. The work shows concrete results on both synthetic retrieval tests and standard video benchmarks while claiming speedups over prior parallelism methods.

Core claim

LongVILA upgrades VLMs to support long video understanding by incorporating long context extension and long video supervised fine-tuning, supported by the long-context Multi-Modal Sequence Parallelism (MM-SP) system that enables 2M context length training on 256 GPUs without gradient checkpointing, extending video frames from 8 to 2048 and achieving 99.8 percent accuracy on 6000-frame video needle-in-a-haystack tasks containing more than one million tokens.

What carries the argument

The two-stage training pipeline of long-context extension followed by long-video supervised fine-tuning, accelerated by Multi-Modal Sequence Parallelism (MM-SP) that parallelizes multi-modal sequences across devices.

If this is right

The 7B LongVILA model reaches 65.1 percent on VideoMME with subtitles and competitive scores on nine other video benchmarks.
MM-SP delivers 2.1x to 5.7x speedups over ring-style sequence parallelism and 1.1x to 1.4x over Megatron hybrid parallelism.
Training reaches 2 million token context lengths on 256 GPUs without any gradient checkpointing.
The system integrates directly with Hugging Face Transformers for both training and inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If MM-SP generalizes cleanly, similar parallelism could be applied to long audio or multi-turn conversation models that also face memory bottlenecks.
The reported frame scaling suggests that future work could test whether the same stages preserve performance when the model size is increased beyond 7B.
Real-world video search or summarization pipelines might adopt this approach once inference latency on long inputs is measured end-to-end.

Load-bearing premise

That the two training stages plus MM-SP can be combined without introducing hidden accuracy drops or unstated data-selection effects when moving from short clips to videos with thousands of frames.

What would settle it

Training the same 7B model on a 6000-frame video set and measuring whether needle-in-a-haystack accuracy falls below 95 percent or requires more than 256 GPUs with checkpointing.

read the original abstract

Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long video supervised fine-tuning. However, training on long video is computationally and memory intensive. We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes long video training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing. LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack. LongVILA-7B demonstrates strong accuracy on 9 popular video benchmarks, e.g. 65.1% VideoMME with subtitle. Besides, MM-SP is 2.1x - 5.7x faster than ring style sequence parallelism and 1.1x - 1.4x faster than Megatron with a hybrid context and tensor parallelism. Moreover, it seamlessly integrates with Hugging Face Transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

LongVILA gives a workable recipe for scaling VLMs to 2048 video frames via two-stage training and MM-SP parallelism, with strong reported NIAH and benchmark numbers but limited visible checks for regressions. The paper takes an existing VLM and adds a long-context extension stage followed by supervised fine-tuning on long videos. They pair this with a custom multi-modal sequence parallelism system that handles the memory and compute load. This combination lets them reach 2048 frames and hit 99.8% on a 6000-frame needle-in-a-haystack test with over a million tokens. They also report 65.1% on VideoMME with subtitles and competitive scores across eight other video benchmarks. MM-SP delivers 2.1x-5.7x speedups over ring-style parallelism and works with Hugging Face Transformers out of the box. Training 2M context on 256 GPUs without gradient checkpointing is a concrete engineering result that others can test directly. The full-stack focus on both the training schedule and the parallelism layer is where the work adds something usable beyond prior sequence-parallelism papers. The main soft spot is the lack of clear before-and-after numbers on short-context or standard video tasks after the long-video fine-tuning stage. Without those comparisons or more detail on how the long-video data was filtered, it is hard to rule out hidden regressions or data-selection effects that could limit how far the gains generalize. The efficiency claims look more solid because they include direct speed comparisons. This paper is for groups building production video understanding systems that need to move past short clips. Engineers who care about training throughput and Hugging Face compatibility will get the most from the MM-SP implementation details and the staged training outline. I would send it to peer review. The reported results and system contributions are concrete enough that referees can verify the experiments and suggest the missing ablations.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LongVILA as a full-stack solution for scaling visual-language models to long video contexts. It extends VILA via two training stages (long-context extension followed by long-video supervised fine-tuning) and introduces Multi-Modal Sequence Parallelism (MM-SP) to enable efficient training and inference on sequences up to 2M tokens across 256 GPUs without gradient checkpointing. Key claims include extending supported video frames from 8 to 2048, 99.8% accuracy on a 6000-frame (>1M token) video needle-in-a-haystack task, strong results on 9 video benchmarks (e.g., 65.1% on VideoMME with subtitles), and 2.1x–5.7x speedups over ring-style sequence parallelism plus 1.1x–1.4x over Megatron with hybrid parallelism, with seamless Hugging Face Transformers integration.

Significance. If the empirical results hold under scrutiny, the work provides a practical demonstration of scaling VLMs to extreme video lengths while maintaining efficiency, which could accelerate progress in long-video understanding applications. The MM-SP system offers concrete engineering value for parallelizing multimodal training on long contexts, and the reported benchmark gains plus framework compatibility support broader adoption. The concrete accuracy numbers on NIAH and speedups are strengths that, if reproducible, advance both algorithmic and systems aspects of multimodal long-context modeling.

major comments (2)

[Abstract / Training Methodology] Abstract and training description: the central scaling claim rests on the two-stage process (long context extension + long video SFT) plus MM-SP preserving accuracy and efficiency up to 2048 frames. However, no pre-/post-SFT comparisons on shorter-context or standard video tasks are reported, leaving open the possibility of hidden regressions that would undermine the 'extends from 8 to 2048 frames' claim without performance trade-offs.
[Experimental Results] Experimental results on NIAH: the 99.8% accuracy on the 6000-frame (>1M token) needle-in-a-haystack is a load-bearing result for the long-context capability. The manuscript provides no details on needle/haystack construction, number of trials, variance across runs, or failure modes, making it difficult to evaluate whether the result generalizes or depends on specific data selection.

minor comments (2)

[Abstract] The abstract states results on '9 popular video benchmarks' but does not enumerate them; the main text should include an explicit list with per-benchmark scores for clarity.
[System Evaluation] MM-SP speedup claims (2.1x–5.7x vs. ring style) would benefit from a table specifying exact context lengths, model sizes, and hardware configurations used in the timing experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications and commitments to revisions where appropriate.

read point-by-point responses

Referee: [Abstract / Training Methodology] Abstract and training description: the central scaling claim rests on the two-stage process (long context extension + long video SFT) plus MM-SP preserving accuracy and efficiency up to 2048 frames. However, no pre-/post-SFT comparisons on shorter-context or standard video tasks are reported, leaving open the possibility of hidden regressions that would undermine the 'extends from 8 to 2048 frames' claim without performance trade-offs.

Authors: We agree that explicit pre- and post-SFT comparisons on shorter-context tasks would strengthen the presentation of the scaling claim. The long-context extension stage is intended to preserve short-context capabilities, as indirectly supported by the final model's strong results across the 9 video benchmarks. In the revised manuscript, we will add a new table or section reporting performance of the base VILA, the long-context extended checkpoint, and the final LongVILA on standard short-video tasks to directly demonstrate the absence of regressions. revision: yes
Referee: [Experimental Results] Experimental results on NIAH: the 99.8% accuracy on the 6000-frame (>1M token) needle-in-a-haystack is a load-bearing result for the long-context capability. The manuscript provides no details on needle/haystack construction, number of trials, variance across runs, or failure modes, making it difficult to evaluate whether the result generalizes or depends on specific data selection.

Authors: We acknowledge that additional methodological details are warranted for reproducibility and to allow readers to assess robustness. The revised manuscript will include an expanded description (in the main text or appendix) of the needle-in-a-haystack construction, including haystack sampling, needle insertion procedure, number of trials, variance or standard deviation across runs, and any observed failure cases. This will clarify that the 99.8% result reflects consistent performance rather than isolated data selection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system-building with externally checkable results

full rationale

The paper presents an engineering solution for scaling VILA to long videos via two-stage training (long-context extension then long-video SFT) plus the MM-SP parallelism system. No equations, derivations, or first-principles predictions appear that reduce to fitted parameters or self-referential definitions by construction. Reported outcomes (99.8% on 6000-frame NIAH, 65.1% on VideoMME, 2.1x-5.7x speedups vs. ring SP) are concrete empirical measurements on public benchmarks that can be reproduced or falsified independently. Any self-citations are incidental and not load-bearing for uniqueness theorems or ansatzes; the central claims rest on experimental measurements rather than tautological renaming or fitted-input predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the contribution is framed as an empirical scaling recipe and system implementation rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5601 in / 1160 out tokens · 47105 ms · 2026-05-17T03:47:16.925127+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
cs.CV 2026-05 unverdicted novelty 8.0

VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.
CATS: Curvature Aware Temporal Selection for efficient long video understanding
cs.CV 2026-05 unverdicted novelty 7.0

CATS uses temporal curvature of query-frame relevance to select informative frames, achieving 93-95% of heavy multi-stage accuracy at 3-4% of the preprocessing cost on long-video benchmarks.
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
cs.CV 2026-05 unverdicted novelty 7.0

VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
Internalized Reasoning for Long-Context Visual Document Understanding
cs.CV 2026-03 unverdicted novelty 7.0

A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
cs.CV 2026-01 unverdicted novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
cs.SD 2025-07 unverdicted novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
cs.CV 2026-05 unverdicted novelty 6.0

Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
cs.CV 2026-05 unverdicted novelty 6.0

VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
Small Vision-Language Models are Smart Compressors for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
cs.CV 2026-03 unverdicted novelty 6.0

SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
cs.CV 2025-05 unverdicted novelty 6.0

Spatial-MLLM boosts MLLM spatial intelligence from 2D inputs via dual encoders initialized from geometry models plus space-aware sampling, claiming state-of-the-art results.
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
EgoSelf: From Memory to Personalized Egocentric Assistant
cs.CV 2026-04 unverdicted novelty 5.0

EgoSelf uses graph-based memory of user interactions to derive personalized profiles and predict future behaviors for egocentric assistants.
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling
cs.CV 2026-04 unverdicted novelty 5.0

EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
cs.CV 2025-01 unverdicted novelty 5.0

InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding be...
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 17 Pith papers · 16 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choroman- ski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901
[5]

Sharegpt4video: Improving video understanding and genera- tion with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and genera- tion with better captions. arXiv preprint arXiv:2406.04325, 2024a. Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient...

work page arXiv
[6]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fine-tuning of long-context large language models. In The International Conference on Learning Representations, 2024b. Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jia...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multi- modal language model. arXiv preprint arXiv:2303.03378,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Towards event-oriented long video understanding

Yifan Du, Kun Zhou, Yuqi Huo, Yifan Li, Wayne Xin Zhao, Haoyu Lu, Zijia Zhao, Bingning Wang, Weipeng Chen, and Ji-Rong Wen. Towards event-oriented long video understanding. CoRR, abs/2406.14129,

work page arXiv
[10]

Usp: A unified sequence parallelism approach for long context generative ai

Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai. arXiv preprint arXiv:2405.07719,

work page arXiv
[11]

Vila2: Vila augmented vila.arXiv preprint arXiv:2407.17453, 2024

Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jang Hyun Cho, Marco Pavone, Song Han, and Hongxu Yin. Vila2: Vila augmented vila. arXiv preprint arXiv:2407.17453,

work page arXiv
[12]

Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos.CoRR, abs/2408.14023,

Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos.CoRR, abs/2408.14023,

work page arXiv
[13]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. CoRR, abs/240...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Shot2story20k: A new benchmark for comprehensive understanding of multi-shot videos

Mingfei Han, Linjie Yang, Xiaojun Chang, and Heng Wang. Shot2story20k: A new benchmark for comprehensive understanding of multi-shot videos. arXiv preprint arXiv:2311.17043,

work page arXiv
[15]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Ra- jbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Chat-univi: Unified visual representation empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. CoRR, abs/2311.08046,

work page arXiv
[17]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

12 Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649,

work page arXiv
[18]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[19]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. CoRR, abs/2408.03326, 2024a. Dacheng Li, Rulin Shao, Anze Xie, Eric Xing, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. Lightseq: : Sequence level parallelism for distributed training ...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models, 2024c. Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. CoRR, a...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model par- allelism. arXiv preprint arXiv:1909.08053,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[23]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Longvlm: Efficient long video understanding via large language models

Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video understanding via large language models. CoRR, abs/2404.03384,

work page arXiv
[26]

Longvideobench: A benchmark for long-context interleaved video-language understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. CoRR, abs/2407.15754,

work page arXiv
[27]

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See-Kiong Ng, and Jiashi Feng. Pllava : Parameter- free llava extension from images to videos for video dense captioning. CoRR, abs/2404.16994,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

X-VILA: cross-modality alignment for large language model

Hanrong Ye, De-An Huang, Yao Lu, Zhiding Yu, Wei Ping, Andrew Tao, Jan Kautz, Song Han, Dan Xu, Pavlo Molchanov, and Hongxu Yin. X-VILA: cross-modality alignment for large language model. CoRR, abs/2405.19335,

work page arXiv
[29]

Flash-vstream: Memory-based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash- vstream: Memory-based real-time understanding for long video streams. CoRR, abs/2406.08085, 2024a. Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from...

work page arXiv
[30]

Accessed: 2024-07-28. 15 A A PPENDIX A.1 L ONG VILA-C APTION We have developed a long video captioning benchmark, LongVILA-Caption, consisting of 100 long videos, with captions generated as detailed in Section 3.3, and verified through human examination. In line with the methodology of VideoChatGPT (Maaz et al., 2024), we evaluate the predictions of each ...

work page 2024
[31]

Specifically, the average scores rise from 2.00 to 3.26, highlighting the model’s enhanced capability in generating accurate and rich captions with more frames

As the number of frames increases, the model’s per- formance improves significantly. Specifically, the average scores rise from 2.00 to 3.26, highlighting the model’s enhanced capability in generating accurate and rich captions with more frames. Table 5: Iteration time (seconds) on the dataset (Chen et al., 2024a) with and with- out our two-stage sharding...

work page 2023
[32]

We found that FSDP offers more efficient memory management, which led us to select it as our default configuration

or Zero-3 (Ra- jbhandari et al., 2020), on 32 H100 GPUs. We found that FSDP offers more efficient memory management, which led us to select it as our default configuration. (Time per iteration, seconds). Sequence Length Zero-3 FSDP ZIGZAG-RINGATTN Ulysses 2D Attention ZIGZAG-RINGATTN Ulysses 2D Attention 320 K OOM OOM OOM 23.57 10.70 11.12 288 K OOM OOM O...

work page 2020