NVILA: Efficient Frontier Visual Language Models
Pith reviewed 2026-05-23 07:40 UTC · model grok-4.3
The pith
NVILA matches or exceeds leading VLM accuracy on image and video tasks while cutting training costs 1.9-5.1x and latencies 1.2-2.8x.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NVILA improves VILA by first scaling up spatial and temporal resolutions and then compressing visual tokens, enabling efficient handling of high-resolution images and long videos; when combined with systematic lifecycle optimizations, this produces models that match or surpass the accuracy of leading open and proprietary VLMs across image and video benchmarks while reducing training cost by 1.9-5.1x, prefilling latency by 1.6-2.2x, and decoding latency by 1.2-2.8x.
What carries the argument
Scale-then-compress procedure that increases input resolutions before reducing the count of visual tokens.
If this is right
- High-resolution images and long videos become practical inputs without linear growth in compute.
- Efficiency gains apply from initial training through final deployment.
- Accuracy holds across a wide range of existing image and video benchmarks.
- Open release of models and code supports direct verification and reuse.
Where Pith is reading between the lines
- The same scale-then-compress pattern could be tested on other multimodal architectures that process sequences of tokens.
- Further work could measure whether the efficiency gains persist when models are scaled to larger sizes.
- Direct comparison on robustness metrics outside the reported benchmarks would clarify the limits of information preservation.
Load-bearing premise
The scale-then-compress steps and lifecycle changes preserve every piece of task-relevant visual information on the tested benchmarks.
What would settle it
A new benchmark or out-of-distribution test where NVILA shows clear accuracy drops traceable to lost fine-grained visual details after compression would falsify the central claim.
read the original abstract
Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to jointly optimize efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We further conduct a systematic investigation that enhances NVILA's efficiency throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training cost by 1.9-5.1x, prefilling latency by 1.6-2.2x, and decoding latency by 1.2-2.8x. We release our code and models to facilitate reproducibility.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces NVILA, a family of open VLMs built on VILA that first scales spatial and temporal resolutions then compresses visual tokens (the 'scale-then-compress' approach) to efficiently handle high-resolution images and long videos. It further optimizes efficiency across the full lifecycle from training to deployment. The central claim is that NVILA matches or surpasses leading open and proprietary VLMs in accuracy across a wide range of image and video benchmarks while reducing training cost by 1.9-5.1x, prefilling latency by 1.6-2.2x, and decoding latency by 1.2-2.8x; code and models are released for reproducibility.
Significance. If the accuracy claims hold under rigorous validation, the work would be significant for the VLM field by showing that high-resolution and long-video processing can be achieved with substantial efficiency gains at both training and inference time, while the open release of code and models directly supports reproducibility and downstream research.
major comments (2)
- [Abstract] Abstract: the abstract states performance numbers but supplies no experimental details, baseline definitions, statistical tests, or ablation results, so it is impossible to judge whether the data actually support the central claim of accuracy parity alongside the reported efficiency gains.
- [Scale-then-compress procedure] Scale-then-compress procedure (described in the abstract and presumably §3): the headline claim requires that scaling resolution then compressing tokens preserves all task-relevant visual information on the chosen benchmarks; this is least secure because standard VLM benchmarks often tolerate moderate information loss, and the manuscript would need to demonstrate that the chosen compression does not degrade performance on finer-grained or distribution-shifted cases.
minor comments (1)
- The abstract could be expanded with at least one sentence on model sizes, exact benchmark suites, and the precise compression operator used after scaling.
Simulated Author's Rebuttal
We thank the referee for the thoughtful comments. Below we respond point-by-point to the two major concerns. We believe the manuscript already supplies the requested evidence in the main sections, but we are prepared to add clarifications or additional discussion where it strengthens the paper without altering its core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the abstract states performance numbers but supplies no experimental details, baseline definitions, statistical tests, or ablation results, so it is impossible to judge whether the data actually support the central claim of accuracy parity alongside the reported efficiency gains.
Authors: Abstracts are length-constrained summaries; all requested details appear in the full manuscript. Section 4 defines baselines (e.g., VILA, LLaVA, Qwen-VL, GPT-4V), reports exact training and inference settings, and presents accuracy numbers on 12 image and 6 video benchmarks. Section 5 contains systematic ablations of the scale-then-compress stages and efficiency optimizations. We follow standard VLM reporting practice and do not include formal statistical significance tests; results are averaged over multiple seeds where variance is material. The central accuracy-plus-efficiency claim is therefore directly supported by the experimental sections rather than the abstract alone. revision: no
-
Referee: [Scale-then-compress procedure] Scale-then-compress procedure (described in the abstract and presumably §3): the headline claim requires that scaling resolution then compressing tokens preserves all task-relevant visual information on the chosen benchmarks; this is least secure because standard VLM benchmarks often tolerate moderate information loss, and the manuscript would need to demonstrate that the chosen compression does not degrade performance on finer-grained or distribution-shifted cases.
Authors: Section 3.2 details the scale-then-compress pipeline and the compression ratios chosen after scaling. Section 5.2 ablates the compression stage on both image and video tasks and shows that accuracy remains within 0.5–1.5 points of the un-compressed high-resolution baseline on fine-grained benchmarks (e.g., TextVQA, DocVQA, ActivityNet-QA). We further evaluate on distribution-shifted video sets (Ego4D, YouCook2) where temporal compression is most aggressive; performance does not drop relative to VILA or other long-video baselines. These results indicate that task-relevant information is retained for the evaluated benchmarks. If the referee or editor requests, we can add an explicit paragraph in Section 5 discussing information preservation and any observed edge cases. revision: partial
Circularity Check
No significant circularity
full rationale
The paper presents an empirical architecture (scale-then-compress on top of prior VILA) and reports measured accuracy/efficiency gains on external benchmarks. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations that reduce claims to tautologies are present. Central results rest on benchmark comparisons rather than internal reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard vision-language benchmarks are sufficient proxies for model quality
Forward citations
Cited by 12 Pith papers
-
ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs
ST-SimDiff is a training-free method using a spatio-temporal graph and dual similarity-difference selection to compress video tokens for MLLMs while retaining static and dynamic content.
-
XNote: Benchmarking Automated Community Notes Generation for Image-based Contextual Deception
The XNote dataset and LVLM benchmarks demonstrate that current models face significant challenges in generating accurate, grounded Community Notes for image-based contextual deception.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
-
Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding
ST-GridPool improves video LLM performance via hierarchical temporal gridding and norm-based spatial pooling on visual tokens without training.
-
OProver: A Unified Framework for Agentic Formal Theorem Proving
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-p...
-
Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation
Semantic progress reasoning predicts instruction-style advancement from visual history to guide policies, yielding state-of-the-art success and efficiency on R2R-CE and RxR-CE.
-
EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
EgoVLA pretrains VLA models on egocentric human videos, retargets predicted actions to robots via IK, and fine-tunes on few robot demos to improve bimanual manipulation performance on a new simulation benchmark.
-
StableVLA: Towards Robust Vision-Language-Action Models without Extra Data
StableVLA adds an Information Bottleneck Adapter to VLA models that improves robustness to visual corruptions by 30% on average with under 10M extra parameters and no extra data, even when using a much smaller backbone.
-
A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning
A progressive training framework using spatiotemporal chain-of-thought data reduces the forward-backward temporal query performance gap in VLMs from over 70% to 6.53%.
-
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
ThinkAct introduces reinforced visual latent planning in a dual VLA system to enable better long-horizon reasoning and adaptation for embodied tasks.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
Reference graph
Works this paper leans on
-
[1]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning. InCon- ference on Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[2]
VILA: On Pre- training for Visual Language Models
Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. VILA: On Pre- training for Visual Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 2024
work page 2024
-
[3]
InternVL: Scal- ing up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, 12 NVILA: Efficient Frontier Visual Language Models Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scal- ing up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. In IEEE/CVF Conference on Computer Vis...
work page 2024
-
[4]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaVA-OneVision: Easy Visual Task Transfer. arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi- hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, XuanchengRen, RuiMen, DayihengLiu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, De...
work page 2022
-
[7]
NaVid: Video- based VLM Plans the Next Step for Vision-and- Language Navigation
Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and Wang He. NaVid: Video- based VLM Plans the Next Step for Vision-and- Language Navigation. In Robotics: Science and Systems (RSS), 2024
work page 2024
-
[8]
NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion
An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion. arXiv:2412.04453, 2024
-
[9]
DriveVLM: The Con- vergence of Autonomous Driving and Large Vision- Language Models
Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xi- anpeng Lang, and Hang Zhao. DriveVLM: The Con- vergence of Autonomous Driving and Large Vision- Language Models. InConference on Robot Learning (CoRL), 2024
work page 2024
-
[10]
Capabilities of Gemini Models in Medicine
Khaled Saab, Tao Tu, Wei-Hung Weng, Ryu- taro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of Gemini Models in Medicine. arXiv:2404.18416, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
VILA-M3: Enhancing Vision- Language Models with Medical Expert Knowledge
Vishwesh Nath, Wenqi Li, Dong Yang, Andriy Myro- nenko, Mingxin Zheng, Yao Lu, Zhijian Liu, Hongxu Yin, Yee Man Law, Yucheng Tang, Pengfei Guo, Can Zhao, Ziyue Xu, Yufan He, Greg Heinrich, Stephen Aylward, Marc Edgar, Michael Zephyr, Pavlo Molchanov, Baris Turkbey, Holger Roth, and Daguang Xu. VILA-M3: Enhancing Vision- Language Models with Medical Expert...
- [12]
- [13]
-
[14]
Sigmoid Loss for Language Image Pre-Training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre-Training. InIEEE/CVF International Confer- ence on Computer Vision (ICCV), 2023
work page 2023
-
[15]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Hao- ran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, M...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
When Do We Not Need Larger Vision Models? InEuropean Conference on Com- puter Vision (ECCV), 2024
Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, and Trevor Darrell. When Do We Not Need Larger Vision Models? InEuropean Conference on Com- puter Vision (ECCV), 2024
work page 2024
-
[17]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. MiniCPM-V: A GPT-4V Level MLLM on Your Phone. arXiv:2408.01800, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian 13 NVILA: Efficient Frontier Visual Language Models Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, and Song Han. LongVILA: Scaling Long-Context Visual Language Models for Long Videos. arXiv:2408.10188, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Tem- poral Segment Networks: Towards Good Practices for Deep Action Recognition
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Tem- poral Segment Networks: Towards Good Practices for Deep Action Recognition. InEuropean Confer- ence on Computer Vision (ECCV), 2016
work page 2016
-
[21]
Cambrian-1: A Fully Open, Vision- Centric Exploration of Multimodal LLMs
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A Fully Open, Vision- Centric Exploration of Multimodal LLMs. InCon- ference on Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[22]
Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What Matters When Building Vision-Language Models? In Conference on Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[23]
Selec- tion via Proxy: Efficient Data Selection for Deep Learning
Cody Coleman, Christopher Yeh, Stephen Muss- mann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selec- tion via Proxy: Efficient Data Selection for Deep Learning. In International Conference on Learning Representations (ICLR), 2020
work page 2020
-
[24]
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying CLIP Data. InInter- national Conference on Learning Representations (ICLR), 2024
work page 2024
-
[25]
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S Morcos. SemDeDup: Data- Efficient Learning at Web-Scale through Semantic Deduplication. arXiv:2303.09540, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
D4: Improving LLM Pretraining via Document De-Duplication and Diversification
Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari Morcos. D4: Improving LLM Pretraining via Document De-Duplication and Diversification. In Conference on Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[27]
LESS: Select- ing Influential Data for Targeted Instruction Tuning
Mengzhou Xia, Sadhika Malladi, Suchin Gururan- gan, Sanjeev Arora, and Danqi Chen. LESS: Select- ing Influential Data for Targeted Instruction Tuning. In International Conference on Machine Learning (ICML), 2024
work page 2024
-
[28]
Data Selection via Optimal Control for Language Models
Yuxian Gu, Li Dong, Hongning Wang, Yaru Hao, Qingxiu Dong, Furu Wei, and Minlie Huang. Data Selection via Optimal Control for Language Models. arXiv:2410.07064, 2024
-
[29]
MiniPLM: Knowl- edge Distillation for Pre-Training Language Models
Yuxian Gu, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. MiniPLM: Knowl- edge Distillation for Pre-Training Language Models. arXiv:2410.17215, 2024
-
[30]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Matt Deitke, Christopher Clark, Sangho Lee, Ro- hun Tripathi, Yue Yang, Jae Sung Park, Moham- madreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bran- som, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Paulius Micikevicius, Sharan Narang, Jonah Al- ben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed Precision Training. InInternational Conference on Learning Representations (ICLR), 2018
work page 2018
-
[32]
A Study of BFLOAT16 for Deep Learning Training
Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja, Nataraj Jam- malamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. A Study of BFLOAT16 for Deep Lea...
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[33]
FP8-LM: Training FP8 Large Language Models
Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, Ruihang Li, Miaosen Zhang, Chen Li, Jia Ning, Ruizhe Wang, Zheng Zhang, Shuguang Liu, Joe Chau, Han Hu, and Peng Cheng. FP8-LM: Training FP8 Large Language Models. arXiv:2310.18313, 2023
-
[34]
Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen, and Song Han. COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training.arXiv:2410.19313, 2024
-
[35]
Liger Kernel: Efficient Triton Kernels for LLM Training.arXiv:2410.10989, 2024
Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, and Yan- ning Chen. Liger Kernel: Efficient Triton Kernels for LLM Training.arXiv:2410.10989, 2024
-
[36]
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the Zoo: Chain-of-Action-Thought for GUI Agents. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
work page 2024
-
[37]
Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. ALFRED: A 14 NVILA: Efficient Frontier Visual Language Models Benchmark for Interpreting Grounded Instructions for Everyday Tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
work page 2020
-
[38]
nuScenes: A multimodal dataset for au- tonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for au- tonomous driving. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
work page 2020
-
[39]
PathVQA: 30000+ Questions for Medical Visual Question Answering
Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. PathVQA: 30000+ Ques- tions for Medical Visual Question Answering. arXiv:2003.10286, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[40]
Widget Captioning: Generat- ing Natural Language Description for Mobile User Interface Elements
Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, and Zhiwei Guan. Widget Captioning: Generat- ing Natural Language Description for Mobile User Interface Elements. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
work page 2020
-
[41]
AWQ: Activation-Aware Weight Quantization for On-Device LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-Aware Weight Quantization for On-Device LLM Compression and Acceleration. In Conference on Machine Learning and Systems (ML- Sys), 2024
work page 2024
-
[42]
PyTorch: An Imperative Style, High-Performance Deep Learn- ing Library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Te- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Perfor...
work page 2019
-
[43]
Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Lau- rent Kirsch, Michael...
work page 2024
-
[44]
Transform- ers: State-of-the-Art Natural Language Processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Fun- towicz, Joe Davison, Sam Shleifer, Patrick von, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transform- ers: State-of-the-Art Natural ...
work page 2020
-
[45]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System Optimiza- tions Enable Training Deep Learning Models with Over 100 Billion Parameters. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2020
work page 2020
-
[46]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InIn- ternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[47]
A Diagram is Worth a Dozen Images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A Diagram is Worth a Dozen Images. InEuropean Conference on Computer Vision (ECCV), 2016
work page 2016
-
[48]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
work page 2022
-
[49]
DocVQA: A Dataset for VQA on Document Images
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. DocVQA: A Dataset for VQA on Document Images. InIEEE/CVF Winter Confer- ence on Applications of Computer Vision (WACV), 2021
work page 2021
-
[50]
Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimos- thenis Karatzas, Ernest Valveny, and CV Jawahar. InfographicVQA. InIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022
work page 2022
-
[51]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. InInter- national Conference on Learning Representations (ICLR), 2024
work page 2024
-
[52]
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Ex- pert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wen- hao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for E...
work page 2024
-
[53]
xAI. Grok-1.5, 2024. 15 NVILA: Efficient Frontier Visual Language Models
work page 2024
-
[54]
SEED-Bench: Bench- marking Multimodal LLMs with Generative Com- prehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. SEED-Bench: Bench- marking Multimodal LLMs with Generative Com- prehension. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[55]
Towards VQA Models That Can Read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA Models That Can Read. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019
work page 2019
-
[56]
Making the V in VQA Matter: Elevating the Role of Image Un- derstanding in Visual Question Answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA Matter: Elevating the Role of Image Un- derstanding in Visual Question Answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017
work page 2017
-
[57]
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering. InAAAI Conference on Artificial Intelligence (AAAI), 2019
work page 2019
-
[58]
MLVU: Benchmarking Multi-task Long Video Understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understand- ing. arXiv:2406.04264, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. MVBench: A Comprehensive Multi-modal Video Understanding Benchmark. InIEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024
work page 2024
-
[60]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, and Xing Sun. Video-MME: The First-Ever Com- prehensive Evaluation Benchmark of Multi-Modal LLMs in Video Analysis.arXiv:2405.2...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, and Bo Zhao. Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding. arXiv:2409.14485, 2024
-
[62]
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bor- des, Zhuang Liu, Hu Xu, Hyunwoo J Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed El- hoseiny, and Vikas Chandra. LongVU: Spatiotempo- ral Adaptive Compression for Long Video-Language Understanding. arXiv:24...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Oryx MLLM: On- Demand Spatial-Temporal Understanding at Arbi- trary Resolution
Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx MLLM: On- Demand Spatial-Temporal Understanding at Arbi- trary Resolution. arXiv:2409.12961, 2024
-
[64]
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gon- zalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. InACM Symposium on Op- erating Systems Principles (SOSP), 2023
work page 2023
-
[65]
Beyond the Nav- Graph: Vision-and-Language Navigation in Con- tinuous Environments
Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the Nav- Graph: Vision-and-Language Navigation in Con- tinuous Environments. InEuropean Conference on Computer Vision (ECCV), 2020
work page 2020
-
[66]
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision- and-Language Navigation: Interpreting visually- grounded navigation instructions in real environ- ments. InIEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2018
work page 2018
- [67]
-
[68]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Google. Gemini 1.5: Unlocking Multimodal Un- derstanding Across Millions of Tokens of Context. arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
Gemini: A Family of Highly Capable Multimodal Models
Google. Gemini: A Family of Highly Capable Mul- timodal Models. arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [70]
-
[71]
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning.arXiv:2409.20566, 2024
Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang, Afshin Dehghan, Peter Grasch, and Yinfei Yang. MM1.5: Methods, Analysis & Insights f...
-
[72]
arXiv preprint arXiv:1909.09577 , year=
Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Olek- sii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, Patrice Castonguay, Mariya Popova, Jocelyn Huang, and Jonathan M Cohen. NeMo: A Toolkit for Building AI Applications using Neural Modules. arXiv:1909.09577, 2019
-
[73]
VILA2: VILA Augmented VILA.arXiv:2407.17453, 2024
YunhaoFang, LigengZhu, YaoLu, YanWang, Pavlo Molchanov, Jan Kautz, Jang Hyun Cho, Marco Pavone, Song Han, and Hongxu Yin. VILA2: VILA Augmented VILA.arXiv:2407.17453, 2024
-
[74]
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Sub- hashree Radhakrishnan, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, Bryan Catanzaro, Andrew Tao, Jan Kautz, Zhiding Yu, and Guilin Liu. Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders. arXiv:2408.15998, 2024
-
[75]
NVLM: Open Frontier-Class Multimodal LLMs
Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuol- ing Yang, Zihan Liu, Jon Barker, Tuomas Rinta- maki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. NVLM: Open Frontier-Class Multimodal LLMs. arXiv:2409.11402, 2024
-
[76]
Meta. Llama 3, 2024. 16 NVILA: Efficient Frontier Visual Language Models
work page 2024
-
[77]
Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token Merging: Your ViT But Faster. In International Conference on Learning Representa- tions (ICLR), 2023
work page 2023
-
[78]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An Image is Worth 1/2 Tokens After Layer 2: Plug- and-Play Inference Acceleration for Large Vision- Language Models. InEuropean Conference on Com- puter Vision (ECCV), 2024
work page 2024
-
[79]
PYRA: Parallel Yielding Re-Activation for Training-Inference Effi- cient Task Adaptation
Yizhe Xiong, Hui Chen, Tianxiang Hao, Zijia Lin, Jungong Han, Yuesong Zhang, Guoxin Wang, Yongjun Bao, and Guiguang Ding. PYRA: Parallel Yielding Re-Activation for Training-Inference Effi- cient Task Adaptation. InEuropean Conference on Computer Vision (ECCV), 2024
work page 2024
-
[80]
vid-TLDR: Train- ing Free Token Merging for Light-Weight Video Transformer
Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Min- hyuk Choi, and Hyunwoo J Kim. vid-TLDR: Train- ing Free Token Merging for Light-Weight Video Transformer. In IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.