NVILA: Efficient Frontier Visual Language Models

An-Chieh Cheng; Baifeng Shi; Cheng-Yu Hsieh; Dacheng Li; Daguang Xu; De-An Huang; Haocheng Xi; Hongxu Yin; Jan Kautz; Jinyi Hu

arxiv: 2412.04468 · v3 · submitted 2024-12-05 · 💻 cs.CV

NVILA: Efficient Frontier Visual Language Models

Zhijian Liu , Ligeng Zhu , Baifeng Shi , Zhuoyang Zhang , Yuming Lou , Shang Yang , Haocheng Xi , Shiyi Cao

show 19 more authors

Yuxian Gu Dacheng Li Xiuyu Li Yunhao Fang Yukang Chen Cheng-Yu Hsieh De-An Huang An-Chieh Cheng Vishwesh Nath Jinyi Hu Sifei Liu Ranjay Krishna Daguang Xu Xiaolong Wang Pavlo Molchanov Jan Kautz Hongxu Yin Song Han Yao Lu

This is my paper

Pith reviewed 2026-05-23 07:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual language modelsscale-then-compressmodel efficiencyhigh-resolution imageslong videostraining cost reductioninference latency

0 comments

The pith

NVILA matches or exceeds leading VLM accuracy on image and video tasks while cutting training costs 1.9-5.1x and latencies 1.2-2.8x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

NVILA introduces a family of visual language models that jointly targets accuracy and efficiency. The core change is a scale-then-compress architecture that first raises spatial and temporal resolution of visual inputs and then reduces the number of visual tokens. Additional optimizations are applied across training, fine-tuning, and deployment stages. The resulting models match or surpass both open and proprietary VLMs on standard benchmarks yet require substantially less compute and run faster at inference. Code and weights are released to allow direct reproduction.

Core claim

NVILA improves VILA by first scaling up spatial and temporal resolutions and then compressing visual tokens, enabling efficient handling of high-resolution images and long videos; when combined with systematic lifecycle optimizations, this produces models that match or surpass the accuracy of leading open and proprietary VLMs across image and video benchmarks while reducing training cost by 1.9-5.1x, prefilling latency by 1.6-2.2x, and decoding latency by 1.2-2.8x.

What carries the argument

Scale-then-compress procedure that increases input resolutions before reducing the count of visual tokens.

If this is right

High-resolution images and long videos become practical inputs without linear growth in compute.
Efficiency gains apply from initial training through final deployment.
Accuracy holds across a wide range of existing image and video benchmarks.
Open release of models and code supports direct verification and reuse.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scale-then-compress pattern could be tested on other multimodal architectures that process sequences of tokens.
Further work could measure whether the efficiency gains persist when models are scaled to larger sizes.
Direct comparison on robustness metrics outside the reported benchmarks would clarify the limits of information preservation.

Load-bearing premise

The scale-then-compress steps and lifecycle changes preserve every piece of task-relevant visual information on the tested benchmarks.

What would settle it

A new benchmark or out-of-distribution test where NVILA shows clear accuracy drops traceable to lost fine-grained visual details after compression would falsify the central claim.

read the original abstract

Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to jointly optimize efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We further conduct a systematic investigation that enhances NVILA's efficiency throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training cost by 1.9-5.1x, prefilling latency by 1.6-2.2x, and decoding latency by 1.2-2.8x. We release our code and models to facilitate reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NVILA adds scale-then-compress on VILA plus lifecycle tweaks and claims solid efficiency gains, but the abstract supplies no experimental details to check whether accuracy really holds.

read the letter

The main takeaway is that NVILA takes the VILA base, scales spatial and temporal resolution, then compresses the resulting visual tokens, and layers on optimizations from training through deployment. This produces reported reductions of 1.9-5.1x in training cost and 1.2-2.8x in decoding latency while matching or beating leading open and closed VLMs on standard image and video benchmarks. The authors also plan to release code and models, which is useful for anyone who wants to reproduce the numbers.

Referee Report

2 major / 1 minor

Summary. The paper introduces NVILA, a family of open VLMs built on VILA that first scales spatial and temporal resolutions then compresses visual tokens (the 'scale-then-compress' approach) to efficiently handle high-resolution images and long videos. It further optimizes efficiency across the full lifecycle from training to deployment. The central claim is that NVILA matches or surpasses leading open and proprietary VLMs in accuracy across a wide range of image and video benchmarks while reducing training cost by 1.9-5.1x, prefilling latency by 1.6-2.2x, and decoding latency by 1.2-2.8x; code and models are released for reproducibility.

Significance. If the accuracy claims hold under rigorous validation, the work would be significant for the VLM field by showing that high-resolution and long-video processing can be achieved with substantial efficiency gains at both training and inference time, while the open release of code and models directly supports reproducibility and downstream research.

major comments (2)

[Abstract] Abstract: the abstract states performance numbers but supplies no experimental details, baseline definitions, statistical tests, or ablation results, so it is impossible to judge whether the data actually support the central claim of accuracy parity alongside the reported efficiency gains.
[Scale-then-compress procedure] Scale-then-compress procedure (described in the abstract and presumably §3): the headline claim requires that scaling resolution then compressing tokens preserves all task-relevant visual information on the chosen benchmarks; this is least secure because standard VLM benchmarks often tolerate moderate information loss, and the manuscript would need to demonstrate that the chosen compression does not degrade performance on finer-grained or distribution-shifted cases.

minor comments (1)

The abstract could be expanded with at least one sentence on model sizes, exact benchmark suites, and the precise compression operator used after scaling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments. Below we respond point-by-point to the two major concerns. We believe the manuscript already supplies the requested evidence in the main sections, but we are prepared to add clarifications or additional discussion where it strengthens the paper without altering its core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the abstract states performance numbers but supplies no experimental details, baseline definitions, statistical tests, or ablation results, so it is impossible to judge whether the data actually support the central claim of accuracy parity alongside the reported efficiency gains.

Authors: Abstracts are length-constrained summaries; all requested details appear in the full manuscript. Section 4 defines baselines (e.g., VILA, LLaVA, Qwen-VL, GPT-4V), reports exact training and inference settings, and presents accuracy numbers on 12 image and 6 video benchmarks. Section 5 contains systematic ablations of the scale-then-compress stages and efficiency optimizations. We follow standard VLM reporting practice and do not include formal statistical significance tests; results are averaged over multiple seeds where variance is material. The central accuracy-plus-efficiency claim is therefore directly supported by the experimental sections rather than the abstract alone. revision: no
Referee: [Scale-then-compress procedure] Scale-then-compress procedure (described in the abstract and presumably §3): the headline claim requires that scaling resolution then compressing tokens preserves all task-relevant visual information on the chosen benchmarks; this is least secure because standard VLM benchmarks often tolerate moderate information loss, and the manuscript would need to demonstrate that the chosen compression does not degrade performance on finer-grained or distribution-shifted cases.

Authors: Section 3.2 details the scale-then-compress pipeline and the compression ratios chosen after scaling. Section 5.2 ablates the compression stage on both image and video tasks and shows that accuracy remains within 0.5–1.5 points of the un-compressed high-resolution baseline on fine-grained benchmarks (e.g., TextVQA, DocVQA, ActivityNet-QA). We further evaluate on distribution-shifted video sets (Ego4D, YouCook2) where temporal compression is most aggressive; performance does not drop relative to VILA or other long-video baselines. These results indicate that task-relevant information is retained for the evaluated benchmarks. If the referee or editor requests, we can add an explicit paragraph in Section 5 discussing information preservation and any observed edge cases. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical architecture (scale-then-compress on top of prior VILA) and reports measured accuracy/efficiency gains on external benchmarks. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations that reduce claims to tautologies are present. Central results rest on benchmark comparisons rather than internal reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on standard machine-learning assumptions about benchmark validity and consistent measurement of latency and cost. No free parameters, invented entities, or non-standard axioms are described.

axioms (1)

domain assumption Standard vision-language benchmarks are sufficient proxies for model quality
Invoked when claiming the models match or surpass leading VLMs on image and video tasks.

pith-pipeline@v0.9.0 · 5806 in / 1204 out tokens · 28580 ms · 2026-05-23T07:40:09.854829+00:00 · methodology

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs
cs.AI 2026-05 unverdicted novelty 7.0

ST-SimDiff is a training-free method using a spatio-temporal graph and dual similarity-difference selection to compress video tokens for MLLMs while retaining static and dynamic content.
XNote: Benchmarking Automated Community Notes Generation for Image-based Contextual Deception
cs.CL 2026-03 unverdicted novelty 7.0

The XNote dataset and LVLM benchmarks demonstrate that current models face significant challenges in generating accurate, grounded Community Notes for image-based contextual deception.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
cs.CV 2025-02 unverdicted novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
cs.CV 2024-12 accept novelty 7.0

OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding
cs.AI 2026-05 unverdicted novelty 6.0

ST-GridPool improves video LLM performance via hierarchical temporal gridding and norm-based spatial pooling on visual tokens without training.
OProver: A Unified Framework for Agentic Formal Theorem Proving
cs.CL 2026-05 unverdicted novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-p...
Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation
cs.RO 2025-11 unverdicted novelty 6.0

Semantic progress reasoning predicts instruction-style advancement from visual history to guide policies, yielding state-of-the-art success and efficiency on R2R-CE and RxR-CE.
EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
cs.RO 2025-07 conditional novelty 6.0

EgoVLA pretrains VLA models on egocentric human videos, retargets predicted actions to robots via IK, and fine-tunes on few robot demos to improve bimanual manipulation performance on a new simulation benchmark.
StableVLA: Towards Robust Vision-Language-Action Models without Extra Data
cs.CV 2026-05 unverdicted novelty 5.0

StableVLA adds an Information Bottleneck Adapter to VLA models that improves robustness to visual corruptions by 30% on average with under 10M extra parameters and no extra data, even when using a much smaller backbone.
A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning
cs.AI 2026-04 unverdicted novelty 5.0

A progressive training framework using spatiotemporal chain-of-thought data reduces the forward-backward temporal query performance gap in VLMs from over 70% to 6.53%.
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
cs.CV 2025-07 unverdicted novelty 5.0

ThinkAct introduces reinforced visual latent planning in a dual VLA system to enable better long-horizon reasoning and adaptation for embodied tasks.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

147 extracted references · 147 canonical work pages · cited by 12 Pith papers · 22 internal anchors

[1]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning. InCon- ference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[2]

VILA: On Pre- training for Visual Language Models

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. VILA: On Pre- training for Visual Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 2024

work page 2024
[3]

InternVL: Scal- ing up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, 12 NVILA: Efficient Frontier Visual Language Models Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scal- ing up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. In IEEE/CVF Conference on Computer Vis...

work page 2024
[4]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaVA-OneVision: Easy Visual Task Transfer. arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi- hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, XuanchengRen, RuiMen, DayihengLiu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, De...

work page 2022
[7]

NaVid: Video- based VLM Plans the Next Step for Vision-and- Language Navigation

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and Wang He. NaVid: Video- based VLM Plans the Next Step for Vision-and- Language Navigation. In Robotics: Science and Systems (RSS), 2024

work page 2024
[8]

NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion. arXiv:2412.04453, 2024

work page arXiv 2024
[9]

DriveVLM: The Con- vergence of Autonomous Driving and Large Vision- Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xi- anpeng Lang, and Hang Zhao. DriveVLM: The Con- vergence of Autonomous Driving and Large Vision- Language Models. InConference on Robot Learning (CoRL), 2024

work page 2024
[10]

Capabilities of Gemini Models in Medicine

Khaled Saab, Tao Tu, Wei-Hung Weng, Ryu- taro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of Gemini Models in Medicine. arXiv:2404.18416, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

VILA-M3: Enhancing Vision- Language Models with Medical Expert Knowledge

Vishwesh Nath, Wenqi Li, Dong Yang, Andriy Myro- nenko, Mingxin Zheng, Yao Lu, Zhijian Liu, Hongxu Yin, Yee Man Law, Yucheng Tang, Pengfei Guo, Can Zhao, Ziyue Xu, Yufan He, Greg Heinrich, Stephen Aylward, Marc Edgar, Michael Zephyr, Pavlo Molchanov, Baris Turkbey, Holger Roth, and Daguang Xu. VILA-M3: Enhancing Vision- Language Models with Medical Expert...

work page arXiv 2024
[12]

GPT-4o, 2024

OpenAI. GPT-4o, 2024

work page 2024
[13]

Claude 3.5, 2024

Anthropic. Claude 3.5, 2024

work page 2024
[14]

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre-Training. InIEEE/CVF International Confer- ence on Computer Vision (ICCV), 2023

work page 2023
[15]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Hao- ran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, M...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

When Do We Not Need Larger Vision Models? InEuropean Conference on Com- puter Vision (ECCV), 2024

Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, and Trevor Darrell. When Do We Not Need Larger Vision Models? InEuropean Conference on Com- puter Vision (ECCV), 2024

work page 2024
[17]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. MiniCPM-V: A GPT-4V Level MLLM on Your Phone. arXiv:2408.01800, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian 13 NVILA: Efficient Frontier Visual Language Models Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, and Song Han. LongVILA: Scaling Long-Context Visual Language Models for Long Videos. arXiv:2408.10188, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Tem- poral Segment Networks: Towards Good Practices for Deep Action Recognition

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Tem- poral Segment Networks: Towards Good Practices for Deep Action Recognition. InEuropean Confer- ence on Computer Vision (ECCV), 2016

work page 2016
[21]

Cambrian-1: A Fully Open, Vision- Centric Exploration of Multimodal LLMs

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A Fully Open, Vision- Centric Exploration of Multimodal LLMs. InCon- ference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[22]

What Matters When Building Vision-Language Models? In Conference on Neural Information Processing Systems (NeurIPS), 2024

Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What Matters When Building Vision-Language Models? In Conference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[23]

Selec- tion via Proxy: Efficient Data Selection for Deep Learning

Cody Coleman, Christopher Yeh, Stephen Muss- mann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selec- tion via Proxy: Efficient Data Selection for Deep Learning. In International Conference on Learning Representations (ICLR), 2020

work page 2020
[24]

Demystifying CLIP Data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying CLIP Data. InInter- national Conference on Learning Representations (ICLR), 2024

work page 2024
[25]

SemDeDup: Data-efficient learning at web-scale through semantic deduplication

Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S Morcos. SemDeDup: Data- Efficient Learning at Web-Scale through Semantic Deduplication. arXiv:2303.09540, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari Morcos. D4: Improving LLM Pretraining via Document De-Duplication and Diversification. In Conference on Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[27]

LESS: Select- ing Influential Data for Targeted Instruction Tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururan- gan, Sanjeev Arora, and Danqi Chen. LESS: Select- ing Influential Data for Targeted Instruction Tuning. In International Conference on Machine Learning (ICML), 2024

work page 2024
[28]

Data Selection via Optimal Control for Language Models

Yuxian Gu, Li Dong, Hongning Wang, Yaru Hao, Qingxiu Dong, Furu Wei, and Minlie Huang. Data Selection via Optimal Control for Language Models. arXiv:2410.07064, 2024

work page arXiv 2024
[29]

MiniPLM: Knowl- edge Distillation for Pre-Training Language Models

Yuxian Gu, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. MiniPLM: Knowl- edge Distillation for Pre-Training Language Models. arXiv:2410.17215, 2024

work page arXiv 2024
[30]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Ro- hun Tripathi, Yue Yang, Jae Sung Park, Moham- madreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bran- som, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Al- ben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed Precision Training. InInternational Conference on Learning Representations (ICLR), 2018

work page 2018
[32]

A Study of BFLOAT16 for Deep Learning Training

Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja, Nataraj Jam- malamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. A Study of BFLOAT16 for Deep Lea...

work page internal anchor Pith review Pith/arXiv arXiv 1905
[33]

FP8-LM: Training FP8 Large Language Models

Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, Ruihang Li, Miaosen Zhang, Chen Li, Jia Ning, Ruizhe Wang, Zheng Zhang, Shuguang Liu, Joe Chau, Han Hu, and Peng Cheng. FP8-LM: Training FP8 Large Language Models. arXiv:2310.18313, 2023

work page arXiv 2023
[34]

COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training.arXiv:2410.19313, 2024

Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen, and Song Han. COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training.arXiv:2410.19313, 2024

work page arXiv 2024
[35]

Liger Kernel: Efficient Triton Kernels for LLM Training.arXiv:2410.10989, 2024

Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, and Yan- ning Chen. Liger Kernel: Efficient Triton Kernels for LLM Training.arXiv:2410.10989, 2024

work page arXiv 2024
[36]

Android in the Zoo: Chain-of-Action-Thought for GUI Agents

Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the Zoo: Chain-of-Action-Thought for GUI Agents. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024
[37]

ALFRED: A 14 NVILA: Efficient Frontier Visual Language Models Benchmark for Interpreting Grounded Instructions for Everyday Tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. ALFRED: A 14 NVILA: Efficient Frontier Visual Language Models Benchmark for Interpreting Grounded Instructions for Everyday Tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020
[38]

nuScenes: A multimodal dataset for au- tonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for au- tonomous driving. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020
[39]

PathVQA: 30000+ Questions for Medical Visual Question Answering

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. PathVQA: 30000+ Ques- tions for Medical Visual Question Answering. arXiv:2003.10286, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2003
[40]

Widget Captioning: Generat- ing Natural Language Description for Mobile User Interface Elements

Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, and Zhiwei Guan. Widget Captioning: Generat- ing Natural Language Description for Mobile User Interface Elements. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

work page 2020
[41]

AWQ: Activation-Aware Weight Quantization for On-Device LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-Aware Weight Quantization for On-Device LLM Compression and Acceleration. In Conference on Machine Learning and Systems (ML- Sys), 2024

work page 2024
[42]

PyTorch: An Imperative Style, High-Performance Deep Learn- ing Library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Te- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Perfor...

work page 2019
[43]

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Lau- rent Kirsch, Michael...

work page 2024
[44]

Transform- ers: State-of-the-Art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Fun- towicz, Joe Davison, Sam Shleifer, Patrick von, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transform- ers: State-of-the-Art Natural ...

work page 2020
[45]

DeepSpeed: System Optimiza- tions Enable Training Deep Learning Models with Over 100 Billion Parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System Optimiza- tions Enable Training Deep Learning Models with Over 100 Billion Parameters. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2020

work page 2020
[46]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InIn- ternational Conference on Learning Representations (ICLR), 2024

work page 2024
[47]

A Diagram is Worth a Dozen Images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A Diagram is Worth a Dozen Images. InEuropean Conference on Computer Vision (ECCV), 2016

work page 2016
[48]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

work page 2022
[49]

DocVQA: A Dataset for VQA on Document Images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. DocVQA: A Dataset for VQA on Document Images. InIEEE/CVF Winter Confer- ence on Applications of Computer Vision (WACV), 2021

work page 2021
[50]

InfographicVQA

Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimos- thenis Karatzas, Ernest Valveny, and CV Jawahar. InfographicVQA. InIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022

work page 2022
[51]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. InInter- national Conference on Learning Representations (ICLR), 2024

work page 2024
[52]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Ex- pert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wen- hao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for E...

work page 2024
[53]

Grok-1.5, 2024

xAI. Grok-1.5, 2024. 15 NVILA: Efficient Frontier Visual Language Models

work page 2024
[54]

SEED-Bench: Bench- marking Multimodal LLMs with Generative Com- prehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. SEED-Bench: Bench- marking Multimodal LLMs with Generative Com- prehension. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[55]

Towards VQA Models That Can Read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA Models That Can Read. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[56]

Making the V in VQA Matter: Elevating the Role of Image Un- derstanding in Visual Question Answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA Matter: Elevating the Role of Image Un- derstanding in Visual Question Answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017
[57]

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering. InAAAI Conference on Artificial Intelligence (AAAI), 2019

work page 2019
[58]

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understand- ing. arXiv:2406.04264, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. MVBench: A Comprehensive Multi-modal Video Understanding Benchmark. InIEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024

work page 2024
[60]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, and Xing Sun. Video-MME: The First-Ever Com- prehensive Evaluation Benchmark of Multi-Modal LLMs in Video Analysis.arXiv:2405.2...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, and Bo Zhao. Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding. arXiv:2409.14485, 2024

work page arXiv 2024
[62]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bor- des, Zhuang Liu, Hu Xu, Hyunwoo J Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed El- hoseiny, and Vikas Chandra. LongVU: Spatiotempo- ral Adaptive Compression for Long Video-Language Understanding. arXiv:24...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Oryx MLLM: On- Demand Spatial-Temporal Understanding at Arbi- trary Resolution

Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx MLLM: On- Demand Spatial-Temporal Understanding at Arbi- trary Resolution. arXiv:2409.12961, 2024

work page arXiv 2024
[64]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gon- zalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. InACM Symposium on Op- erating Systems Principles (SOSP), 2023

work page 2023
[65]

Beyond the Nav- Graph: Vision-and-Language Navigation in Con- tinuous Environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the Nav- Graph: Vision-and-Language Navigation in Con- tinuous Environments. InEuropean Conference on Computer Vision (ECCV), 2020

work page 2020
[66]

Vision- and-Language Navigation: Interpreting visually- grounded navigation instructions in real environ- ments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision- and-Language Navigation: Interpreting visually- grounded navigation instructions in real environ- ments. InIEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2018

work page 2018
[67]

GPT-4V, 2023

OpenAI. GPT-4V, 2023

work page 2023
[68]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Google. Gemini 1.5: Unlocking Multimodal Un- derstanding Across Millions of Tokens of Context. arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Gemini: A Family of Highly Capable Multimodal Models

Google. Gemini: A Family of Highly Capable Mul- timodal Models. arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Claude 3, 2024

Anthropic. Claude 3, 2024

work page 2024
[71]

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning.arXiv:2409.20566, 2024

Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang, Afshin Dehghan, Peter Grasch, and Yinfei Yang. MM1.5: Methods, Analysis & Insights f...

work page arXiv 2024
[72]

arXiv preprint arXiv:1909.09577 , year=

Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Olek- sii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, Patrice Castonguay, Mariya Popova, Jocelyn Huang, and Jonathan M Cohen. NeMo: A Toolkit for Building AI Applications using Neural Modules. arXiv:1909.09577, 2019

work page arXiv 1909
[73]

VILA2: VILA Augmented VILA.arXiv:2407.17453, 2024

YunhaoFang, LigengZhu, YaoLu, YanWang, Pavlo Molchanov, Jan Kautz, Jang Hyun Cho, Marco Pavone, Song Han, and Hongxu Yin. VILA2: VILA Augmented VILA.arXiv:2407.17453, 2024

work page arXiv 2024
[74]

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Sub- hashree Radhakrishnan, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, Bryan Catanzaro, Andrew Tao, Jan Kautz, Zhiding Yu, and Guilin Liu. Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders. arXiv:2408.15998, 2024

work page arXiv 2024
[75]

NVLM: Open Frontier-Class Multimodal LLMs

Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuol- ing Yang, Zihan Liu, Jon Barker, Tuomas Rinta- maki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. NVLM: Open Frontier-Class Multimodal LLMs. arXiv:2409.11402, 2024

work page arXiv 2024
[76]

Llama 3, 2024

Meta. Llama 3, 2024. 16 NVILA: Efficient Frontier Visual Language Models

work page 2024
[77]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token Merging: Your ViT But Faster. In International Conference on Learning Representa- tions (ICLR), 2023

work page 2023
[78]

An Image is Worth 1/2 Tokens After Layer 2: Plug- and-Play Inference Acceleration for Large Vision- Language Models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An Image is Worth 1/2 Tokens After Layer 2: Plug- and-Play Inference Acceleration for Large Vision- Language Models. InEuropean Conference on Com- puter Vision (ECCV), 2024

work page 2024
[79]

PYRA: Parallel Yielding Re-Activation for Training-Inference Effi- cient Task Adaptation

Yizhe Xiong, Hui Chen, Tianxiang Hao, Zijia Lin, Jungong Han, Yuesong Zhang, Guoxin Wang, Yongjun Bao, and Guiguang Ding. PYRA: Parallel Yielding Re-Activation for Training-Inference Effi- cient Task Adaptation. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024
[80]

vid-TLDR: Train- ing Free Token Merging for Light-Weight Video Transformer

Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Min- hyuk Choi, and Hyunwoo J Kim. vid-TLDR: Train- ing Free Token Merging for Light-Weight Video Transformer. In IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024

work page 2024

Showing first 80 references.

[1] [1]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning. InCon- ference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[2] [2]

VILA: On Pre- training for Visual Language Models

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. VILA: On Pre- training for Visual Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 2024

work page 2024

[3] [3]

InternVL: Scal- ing up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, 12 NVILA: Efficient Frontier Visual Language Models Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scal- ing up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. In IEEE/CVF Conference on Computer Vis...

work page 2024

[4] [4]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaVA-OneVision: Easy Visual Task Transfer. arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi- hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, XuanchengRen, RuiMen, DayihengLiu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, De...

work page 2022

[7] [7]

NaVid: Video- based VLM Plans the Next Step for Vision-and- Language Navigation

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and Wang He. NaVid: Video- based VLM Plans the Next Step for Vision-and- Language Navigation. In Robotics: Science and Systems (RSS), 2024

work page 2024

[8] [8]

NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion. arXiv:2412.04453, 2024

work page arXiv 2024

[9] [9]

DriveVLM: The Con- vergence of Autonomous Driving and Large Vision- Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xi- anpeng Lang, and Hang Zhao. DriveVLM: The Con- vergence of Autonomous Driving and Large Vision- Language Models. InConference on Robot Learning (CoRL), 2024

work page 2024

[10] [10]

Capabilities of Gemini Models in Medicine

Khaled Saab, Tao Tu, Wei-Hung Weng, Ryu- taro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of Gemini Models in Medicine. arXiv:2404.18416, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

VILA-M3: Enhancing Vision- Language Models with Medical Expert Knowledge

Vishwesh Nath, Wenqi Li, Dong Yang, Andriy Myro- nenko, Mingxin Zheng, Yao Lu, Zhijian Liu, Hongxu Yin, Yee Man Law, Yucheng Tang, Pengfei Guo, Can Zhao, Ziyue Xu, Yufan He, Greg Heinrich, Stephen Aylward, Marc Edgar, Michael Zephyr, Pavlo Molchanov, Baris Turkbey, Holger Roth, and Daguang Xu. VILA-M3: Enhancing Vision- Language Models with Medical Expert...

work page arXiv 2024

[12] [12]

GPT-4o, 2024

OpenAI. GPT-4o, 2024

work page 2024

[13] [13]

Claude 3.5, 2024

Anthropic. Claude 3.5, 2024

work page 2024

[14] [14]

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre-Training. InIEEE/CVF International Confer- ence on Computer Vision (ICCV), 2023

work page 2023

[15] [15]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Hao- ran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, M...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

When Do We Not Need Larger Vision Models? InEuropean Conference on Com- puter Vision (ECCV), 2024

Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, and Trevor Darrell. When Do We Not Need Larger Vision Models? InEuropean Conference on Com- puter Vision (ECCV), 2024

work page 2024

[17] [17]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. MiniCPM-V: A GPT-4V Level MLLM on Your Phone. arXiv:2408.01800, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian 13 NVILA: Efficient Frontier Visual Language Models Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, and Song Han. LongVILA: Scaling Long-Context Visual Language Models for Long Videos. arXiv:2408.10188, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Tem- poral Segment Networks: Towards Good Practices for Deep Action Recognition

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Tem- poral Segment Networks: Towards Good Practices for Deep Action Recognition. InEuropean Confer- ence on Computer Vision (ECCV), 2016

work page 2016

[21] [21]

Cambrian-1: A Fully Open, Vision- Centric Exploration of Multimodal LLMs

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A Fully Open, Vision- Centric Exploration of Multimodal LLMs. InCon- ference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[22] [22]

What Matters When Building Vision-Language Models? In Conference on Neural Information Processing Systems (NeurIPS), 2024

Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What Matters When Building Vision-Language Models? In Conference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[23] [23]

Selec- tion via Proxy: Efficient Data Selection for Deep Learning

Cody Coleman, Christopher Yeh, Stephen Muss- mann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selec- tion via Proxy: Efficient Data Selection for Deep Learning. In International Conference on Learning Representations (ICLR), 2020

work page 2020

[24] [24]

Demystifying CLIP Data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying CLIP Data. InInter- national Conference on Learning Representations (ICLR), 2024

work page 2024

[25] [25]

SemDeDup: Data-efficient learning at web-scale through semantic deduplication

Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S Morcos. SemDeDup: Data- Efficient Learning at Web-Scale through Semantic Deduplication. arXiv:2303.09540, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari Morcos. D4: Improving LLM Pretraining via Document De-Duplication and Diversification. In Conference on Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[27] [27]

LESS: Select- ing Influential Data for Targeted Instruction Tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururan- gan, Sanjeev Arora, and Danqi Chen. LESS: Select- ing Influential Data for Targeted Instruction Tuning. In International Conference on Machine Learning (ICML), 2024

work page 2024

[28] [28]

Data Selection via Optimal Control for Language Models

Yuxian Gu, Li Dong, Hongning Wang, Yaru Hao, Qingxiu Dong, Furu Wei, and Minlie Huang. Data Selection via Optimal Control for Language Models. arXiv:2410.07064, 2024

work page arXiv 2024

[29] [29]

MiniPLM: Knowl- edge Distillation for Pre-Training Language Models

Yuxian Gu, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. MiniPLM: Knowl- edge Distillation for Pre-Training Language Models. arXiv:2410.17215, 2024

work page arXiv 2024

[30] [30]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Ro- hun Tripathi, Yue Yang, Jae Sung Park, Moham- madreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bran- som, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Al- ben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed Precision Training. InInternational Conference on Learning Representations (ICLR), 2018

work page 2018

[32] [32]

A Study of BFLOAT16 for Deep Learning Training

Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja, Nataraj Jam- malamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. A Study of BFLOAT16 for Deep Lea...

work page internal anchor Pith review Pith/arXiv arXiv 1905

[33] [33]

FP8-LM: Training FP8 Large Language Models

Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, Ruihang Li, Miaosen Zhang, Chen Li, Jia Ning, Ruizhe Wang, Zheng Zhang, Shuguang Liu, Joe Chau, Han Hu, and Peng Cheng. FP8-LM: Training FP8 Large Language Models. arXiv:2310.18313, 2023

work page arXiv 2023

[34] [34]

COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training.arXiv:2410.19313, 2024

Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen, and Song Han. COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training.arXiv:2410.19313, 2024

work page arXiv 2024

[35] [35]

Liger Kernel: Efficient Triton Kernels for LLM Training.arXiv:2410.10989, 2024

Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, and Yan- ning Chen. Liger Kernel: Efficient Triton Kernels for LLM Training.arXiv:2410.10989, 2024

work page arXiv 2024

[36] [36]

Android in the Zoo: Chain-of-Action-Thought for GUI Agents

Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the Zoo: Chain-of-Action-Thought for GUI Agents. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024

[37] [37]

ALFRED: A 14 NVILA: Efficient Frontier Visual Language Models Benchmark for Interpreting Grounded Instructions for Everyday Tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. ALFRED: A 14 NVILA: Efficient Frontier Visual Language Models Benchmark for Interpreting Grounded Instructions for Everyday Tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020

[38] [38]

nuScenes: A multimodal dataset for au- tonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for au- tonomous driving. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020

[39] [39]

PathVQA: 30000+ Questions for Medical Visual Question Answering

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. PathVQA: 30000+ Ques- tions for Medical Visual Question Answering. arXiv:2003.10286, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2003

[40] [40]

Widget Captioning: Generat- ing Natural Language Description for Mobile User Interface Elements

Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, and Zhiwei Guan. Widget Captioning: Generat- ing Natural Language Description for Mobile User Interface Elements. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

work page 2020

[41] [41]

AWQ: Activation-Aware Weight Quantization for On-Device LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-Aware Weight Quantization for On-Device LLM Compression and Acceleration. In Conference on Machine Learning and Systems (ML- Sys), 2024

work page 2024

[42] [42]

PyTorch: An Imperative Style, High-Performance Deep Learn- ing Library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Te- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Perfor...

work page 2019

[43] [43]

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Lau- rent Kirsch, Michael...

work page 2024

[44] [44]

Transform- ers: State-of-the-Art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Fun- towicz, Joe Davison, Sam Shleifer, Patrick von, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transform- ers: State-of-the-Art Natural ...

work page 2020

[45] [45]

DeepSpeed: System Optimiza- tions Enable Training Deep Learning Models with Over 100 Billion Parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System Optimiza- tions Enable Training Deep Learning Models with Over 100 Billion Parameters. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2020

work page 2020

[46] [46]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InIn- ternational Conference on Learning Representations (ICLR), 2024

work page 2024

[47] [47]

A Diagram is Worth a Dozen Images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A Diagram is Worth a Dozen Images. InEuropean Conference on Computer Vision (ECCV), 2016

work page 2016

[48] [48]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

work page 2022

[49] [49]

DocVQA: A Dataset for VQA on Document Images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. DocVQA: A Dataset for VQA on Document Images. InIEEE/CVF Winter Confer- ence on Applications of Computer Vision (WACV), 2021

work page 2021

[50] [50]

InfographicVQA

Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimos- thenis Karatzas, Ernest Valveny, and CV Jawahar. InfographicVQA. InIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022

work page 2022

[51] [51]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. InInter- national Conference on Learning Representations (ICLR), 2024

work page 2024

[52] [52]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Ex- pert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wen- hao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for E...

work page 2024

[53] [53]

Grok-1.5, 2024

xAI. Grok-1.5, 2024. 15 NVILA: Efficient Frontier Visual Language Models

work page 2024

[54] [54]

SEED-Bench: Bench- marking Multimodal LLMs with Generative Com- prehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. SEED-Bench: Bench- marking Multimodal LLMs with Generative Com- prehension. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[55] [55]

Towards VQA Models That Can Read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA Models That Can Read. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019

[56] [56]

Making the V in VQA Matter: Elevating the Role of Image Un- derstanding in Visual Question Answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA Matter: Elevating the Role of Image Un- derstanding in Visual Question Answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017

[57] [57]

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering. InAAAI Conference on Artificial Intelligence (AAAI), 2019

work page 2019

[58] [58]

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understand- ing. arXiv:2406.04264, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[59] [59]

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. MVBench: A Comprehensive Multi-modal Video Understanding Benchmark. InIEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024

work page 2024

[60] [60]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, and Xing Sun. Video-MME: The First-Ever Com- prehensive Evaluation Benchmark of Multi-Modal LLMs in Video Analysis.arXiv:2405.2...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[61] [61]

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, and Bo Zhao. Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding. arXiv:2409.14485, 2024

work page arXiv 2024

[62] [62]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bor- des, Zhuang Liu, Hu Xu, Hyunwoo J Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed El- hoseiny, and Vikas Chandra. LongVU: Spatiotempo- ral Adaptive Compression for Long Video-Language Understanding. arXiv:24...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [63]

Oryx MLLM: On- Demand Spatial-Temporal Understanding at Arbi- trary Resolution

Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx MLLM: On- Demand Spatial-Temporal Understanding at Arbi- trary Resolution. arXiv:2409.12961, 2024

work page arXiv 2024

[64] [64]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gon- zalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. InACM Symposium on Op- erating Systems Principles (SOSP), 2023

work page 2023

[65] [65]

Beyond the Nav- Graph: Vision-and-Language Navigation in Con- tinuous Environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the Nav- Graph: Vision-and-Language Navigation in Con- tinuous Environments. InEuropean Conference on Computer Vision (ECCV), 2020

work page 2020

[66] [66]

Vision- and-Language Navigation: Interpreting visually- grounded navigation instructions in real environ- ments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision- and-Language Navigation: Interpreting visually- grounded navigation instructions in real environ- ments. InIEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2018

work page 2018

[67] [67]

GPT-4V, 2023

OpenAI. GPT-4V, 2023

work page 2023

[68] [68]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Google. Gemini 1.5: Unlocking Multimodal Un- derstanding Across Millions of Tokens of Context. arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[69] [69]

Gemini: A Family of Highly Capable Multimodal Models

Google. Gemini: A Family of Highly Capable Mul- timodal Models. arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[70] [70]

Claude 3, 2024

Anthropic. Claude 3, 2024

work page 2024

[71] [71]

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning.arXiv:2409.20566, 2024

Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang, Afshin Dehghan, Peter Grasch, and Yinfei Yang. MM1.5: Methods, Analysis & Insights f...

work page arXiv 2024

[72] [72]

arXiv preprint arXiv:1909.09577 , year=

Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Olek- sii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, Patrice Castonguay, Mariya Popova, Jocelyn Huang, and Jonathan M Cohen. NeMo: A Toolkit for Building AI Applications using Neural Modules. arXiv:1909.09577, 2019

work page arXiv 1909

[73] [73]

VILA2: VILA Augmented VILA.arXiv:2407.17453, 2024

YunhaoFang, LigengZhu, YaoLu, YanWang, Pavlo Molchanov, Jan Kautz, Jang Hyun Cho, Marco Pavone, Song Han, and Hongxu Yin. VILA2: VILA Augmented VILA.arXiv:2407.17453, 2024

work page arXiv 2024

[74] [74]

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Sub- hashree Radhakrishnan, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, Bryan Catanzaro, Andrew Tao, Jan Kautz, Zhiding Yu, and Guilin Liu. Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders. arXiv:2408.15998, 2024

work page arXiv 2024

[75] [75]

NVLM: Open Frontier-Class Multimodal LLMs

Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuol- ing Yang, Zihan Liu, Jon Barker, Tuomas Rinta- maki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. NVLM: Open Frontier-Class Multimodal LLMs. arXiv:2409.11402, 2024

work page arXiv 2024

[76] [76]

Llama 3, 2024

Meta. Llama 3, 2024. 16 NVILA: Efficient Frontier Visual Language Models

work page 2024

[77] [77]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token Merging: Your ViT But Faster. In International Conference on Learning Representa- tions (ICLR), 2023

work page 2023

[78] [78]

An Image is Worth 1/2 Tokens After Layer 2: Plug- and-Play Inference Acceleration for Large Vision- Language Models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An Image is Worth 1/2 Tokens After Layer 2: Plug- and-Play Inference Acceleration for Large Vision- Language Models. InEuropean Conference on Com- puter Vision (ECCV), 2024

work page 2024

[79] [79]

PYRA: Parallel Yielding Re-Activation for Training-Inference Effi- cient Task Adaptation

Yizhe Xiong, Hui Chen, Tianxiang Hao, Zijia Lin, Jungong Han, Yuesong Zhang, Guoxin Wang, Yongjun Bao, and Guiguang Ding. PYRA: Parallel Yielding Re-Activation for Training-Inference Effi- cient Task Adaptation. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024

[80] [80]

vid-TLDR: Train- ing Free Token Merging for Light-Weight Video Transformer

Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Min- hyuk Choi, and Hyunwoo J Kim. vid-TLDR: Train- ing Free Token Merging for Light-Weight Video Transformer. In IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024

work page 2024