Recognition: 2 theorem links
SmolVLM: Redefining small and efficient multimodal models
Pith reviewed 2026-05-13 20:18 UTC · model grok-4.3
The pith
SmolVLM shows optimized small vision-language models can outperform much larger ones with far less memory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SmolVLM models are created through systematic choices in architecture, tokenization, and data curation that cut computational costs. The 256M-parameter version uses less than 1GB GPU memory at inference time and outperforms the 80B-parameter Idefics model. The 2.2B-parameter version matches current high-performing vision-language models while using half their memory. The same models also handle video understanding tasks effectively.
What carries the argument
Efficient tokenization strategies paired with targeted architectural changes and curated training data that lower memory demand while preserving task accuracy.
If this is right
- Multimodal image and video tasks become feasible on mobile and edge hardware without large servers.
- Energy use for running vision-language models drops enough to support always-on applications.
- Development focus can shift from ever-larger parameter counts to smarter design for smaller scales.
- Video comprehension features can be added to devices with tight memory budgets.
Where Pith is reading between the lines
- Efficiency-focused design may prove more important than raw scale for many real-world multimodal uses.
- The same tokenization and curation tactics could transfer to other compact models in different domains.
- Testing these models on even tighter constraints like CPU-only or quantized inference would reveal further limits.
Load-bearing premise
The performance comparisons assume that small and large models were tested under identical evaluation rules and similar training data conditions.
What would settle it
Re-running the exact same benchmarks on SmolVLM-256M and Idefics-80B with matched hardware, prompts, and data splits would show whether the small model truly outperforms.
read the original abstract
Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications. We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints. Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities. Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SmolVLM, a family of compact vision-language models (256M to 2.2B parameters) engineered for low GPU memory usage via optimized architectures, tokenization, and data curation. Central claims are that SmolVLM-256M uses <1GB memory at inference and outperforms the 300x larger Idefics-80B despite an 18-month gap, while the 2.2B variant rivals SOTA VLMs at half the memory; models also show video comprehension.
Significance. If the performance claims hold under matched conditions, the work would be significant for demonstrating that targeted efficiency optimizations can enable competitive multimodal performance at small scales, directly supporting on-device and edge deployment of VLMs where large models are impractical.
major comments (2)
- [Abstract] Abstract: The load-bearing claim that SmolVLM-256M outperforms Idefics-80B requires explicit verification that Idefics results were obtained under identical evaluation protocols, datasets, prompt templates, decoding parameters, and task formulations; without this, the 18-month development gap introduces uncontrolled confounds that prevent interpreting the gap as evidence of superior design.
- [Results] Results/Experiments (inferred from abstract): Reported benchmark wins are presented without error bars, ablation tables, or training curves, preventing assessment of statistical robustness and isolating the contribution of the claimed architectural and tokenization choices.
minor comments (1)
- [Abstract] Abstract: The phrase 'outperforms the 300-times larger Idefics-80B model' would benefit from a parenthetical note on the exact benchmarks and settings used for both models.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have incorporated revisions to strengthen the presentation of our results and evaluation protocols.
read point-by-point responses
-
Referee: [Abstract] Abstract: The load-bearing claim that SmolVLM-256M outperforms Idefics-80B requires explicit verification that Idefics results were obtained under identical evaluation protocols, datasets, prompt templates, decoding parameters, and task formulations; without this, the 18-month development gap introduces uncontrolled confounds that prevent interpreting the gap as evidence of superior design.
Authors: We appreciate the referee highlighting the need for explicit protocol matching. In the revised manuscript we have added a dedicated evaluation protocol subsection that documents the exact datasets, prompt templates, decoding parameters (temperature, top-p, max new tokens), and task formulations used for all models, including direct alignment with the publicly reported Idefics-80B setup. While we cannot re-execute the 80B model due to resource constraints, the comparisons rely on standardized public benchmarks whose protocols are well-documented in the original papers; we have also added a brief discussion of the temporal gap and why the observed efficiency gains remain attributable to our design choices rather than uncontrolled variables. revision: partial
-
Referee: [Results] Results/Experiments (inferred from abstract): Reported benchmark wins are presented without error bars, ablation tables, or training curves, preventing assessment of statistical robustness and isolating the contribution of the claimed architectural and tokenization choices.
Authors: We agree that these elements improve interpretability. The revised results section now includes error bars computed over three independent evaluation runs for the primary benchmarks, a new ablation table isolating the impact of tokenization strategy and architectural modifications, and training curves (loss and validation accuracy) placed in the supplementary material. These additions allow readers to assess statistical robustness and the specific contributions of the optimizations we claim. revision: yes
Circularity Check
No significant circularity in empirical performance claims
full rationale
The paper reports experimental results from training and evaluating a family of compact VLMs, with performance numbers presented as direct measurements on image and video tasks. No mathematical derivations, equations, or first-principles predictions appear that reduce by construction to fitted inputs, self-citations, or renamed ansatzes. The headline outperformance claim (SmolVLM-256M vs. Idefics-80B) is an empirical comparison rather than a derived quantity; any concerns about benchmark equivalence fall under validity rather than circularity. The design exploration is described as systematic search over configurations, not tautological self-definition. The derivation chain is therefore self-contained as a report of measured outcomes.
Axiom & Free-Parameter Ledger
free parameters (2)
- model parameter count
- tokenization compression ratio
axioms (1)
- domain assumption Transformer-based vision-language architecture remains effective at small scale
Forward citations
Cited by 27 Pith papers
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning
VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
-
Anny-Fit: All-Age Human Mesh Recovery
Anny-Fit jointly optimizes all-age multi-person 3D human meshes in camera coordinates using complementary signals from off-the-shelf depth, segmentation, keypoint, and VLM networks, yielding better reprojection, depth...
-
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
-
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
-
VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models
VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can...
-
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
ChartNet is a million-scale multimodal dataset for chart understanding created via code-guided synthesis spanning 24 chart types with five aligned modalities per sample.
-
SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models
SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
VISOR: A Vision-Language Model-based Test Oracle for Testing Robot
VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...
-
Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection
The paper releases the Sens-VisualNews dataset of 9,576 annotated news images for sensational image detection and benchmarks open multimodal LLMs on zero-shot and fine-tuned performance.
-
NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics
VLMs fail to identify visual preconditions or apply physical laws in kinematic physics tasks, as shown by new FACT diagnostics and NICE calibration methods evaluated on six state-of-the-art models.
-
Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization
Parameter-efficient fine-tuning lets MLLMs serve as effective retrievers for natural-language-guided cross-view geo-localization, beating dual-encoder baselines on GeoText-1652 and CVG-Text while using far fewer train...
-
BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs
VLMs exhibit a consistent 'Texture Bias Cliff' and fail to comprehend pure geometric shapes from boundary contours alone in zero-shot settings.
-
E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.
-
An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis
A VLM framework with spatial patch cross-attention and adaptive PID-Tversky loss reports 90.69% classification accuracy, 0.9512 Dice score, and 92.80 CIDEr for LSS diagnosis plus automated report generation.
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering
LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.
-
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
-
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
-
SVD-Prune: Training-Free Token Pruning For Efficient Vision-Language Models
SVD-Prune selects vision tokens via SVD leverage scores to keep performance high even when pruning to only 16-32 tokens.
-
SALLIE: Safeguarding Against Latent Language & Image Exploits
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
-
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
-
WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models
WSVD delivers over 1.8x faster VLM decoding via weighted low-rank approximation at fine granularity plus quantization, without accuracy loss.
-
Lifelong Learning in Vision-Language Models: Enhanced EWC with Cross-Modal Knowledge Retention
Enhanced EWC for LVLMs cuts forgetting rates by 78% versus naive training and keeps visual-textual alignment with 15% extra compute.
-
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.
-
Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification
Detection-guided prompting raises small VLM hazard F1 from 34.5% to 50.6% and BERTScore from 0.61 to 0.82 on construction images with only 2.5 ms added latency.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bińko...
work page 2022
-
[3]
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
https://arxiv.org/abs/2502.02737. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles
doi: 10.1109/ICCV.2019.00439. Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceedings of the ieee conference on computer vision and pattern recognition, pages 961–970,
-
[7]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024a. 14 Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, e...
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
https: //arxiv.org/abs/2501.12948. Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, E...
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
https://arxiv.org/abs/2409.17146. Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models.arXiv preprint arXiv:2407.11691,
work page internal anchor Pith review arXiv
-
[11]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
ColPali: Efficient Document Retrieval with Vision Language Models
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449, 2024a. Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval wit...
work page internal anchor Pith review arXiv
-
[13]
H2ovl-mississippi vision language models technical report, 2024.https://arxiv.org/abs/2410.13611
Shaikat Galib, Shanshan Wang, Guanshuo Xu, Pascal Pfeiffer, Ryan Chesler, Mark Landry, and Sri Satish Ambati. H2ovl-mississippi vision language models technical report, 2024.https://arxiv.org/abs/2410.13611. Philippe Gervais, Asya Fadeeva, and Andrii Maksai. Mathwriting: A dataset for handwritten mathematical expression recognition,
-
[14]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh
https://arxiv.org/abs/2404.10690. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913,
-
[15]
Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xiang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale.arXiv preprint arXiv:2412.05237,
-
[16]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
arXiv preprint arXiv:2502.04326 (2025)
Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms, 2025.https://arxiv.org/abs/2502.04326. Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Ch...
-
[18]
The stack: 3 tb of permissively licensed source code
Springer International Publishing. ISBN 978-3-319-46493-0. Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, et al. The stack: 3 tb of permissively licensed source code.arXiv preprint arXiv:2211.15533,
-
[19]
Vik Korrapati. Moondream. Online, 2024.https://moondream.ai/. Accessed: 2025-03-27. Hugo Laurençon, Lucile Saulnier, Leo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. OBELICS: An open web-scale filtered dataset of interleaved image-text documents. In...
work page 2024
-
[20]
https://openreview.net/forum?id=SKN2hflBIZ. Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? arXiv preprint arXiv:2405.02246,
-
[21]
Building and better understanding vision-language models: insights and future directions
Hugo Laurençon, Andrés Marafioti, Victor Sanh, and Léo Tronchon. Building and better understanding vision-language models: insights and future directions, 2024.https://arxiv.org/abs/2408.12637. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-LM: In search o...
-
[22]
Eagle 2: Building post-training data strategies from scratch for frontier vision-language models
Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models. arXiv preprint arXiv:2501.14818, 2025b. Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-lla...
-
[23]
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee
https://openreview.net/forum?id=w0H2xGHlkw. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved rea- soning, ocr, and world knowledge, January 2024a.https://llava-vl.github.io/blog/2024-01-30-llava-next/ . 16 Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, ...
-
[25]
https://arxiv.org/abs/2503.22727. Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024a. Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Pet...
-
[26]
ChartQA: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, May
work page 2022
-
[27]
doi: 10.18653/v1/2022.findings-acl.177
Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.177. https://aclanthology.org/2022.findings-acl.177. Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. Docvqa: A dataset for vqa on document images. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2199–2208,
-
[28]
doi: 10.1109/WACV48630.2021.00225. Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier Biard, Sam Dodge, Philipp Dufter, Bowen Zhang, Dhruti Shah, Xianzhi Du, Futang Peng, Haotian Zhang, Floris Weers, Anton Belyi, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu He, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei...
-
[29]
Weiming Ren, Huan Yang, Jie Min, Cong Wei, and Wenhu Chen. Vista: Enhancing long-duration and high-resolution video understanding by video spatiotemporal augmentation, 2024.https://arxiv.org/abs/2412.00927. Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, et al. Eagl...
-
[30]
https://arxiv.org/abs/1609.05158. Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding.arXiv preprint arXiv:2409.14485,
-
[31]
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video understanding, 2024.https://arxiv.org/abs/2307.16449. Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xia...
-
[32]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024a. Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. Videoagent: Long-form video understanding wit...
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
https://arxiv.org/abs/2412.10302. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing.arXiv preprint arXiv:2406.08464,
-
[35]
Vript: A video is worth thousands of words, 2024.https://arxiv.org/abs/2406.06040
Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, and Hai Zhao. Vript: A video is worth thousands of words, 2024.https://arxiv.org/abs/2406.06040. Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Sh...
-
[36]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
https://arxiv.org/abs/2408.01800. Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, et al. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
arXiv preprint arXiv:2309.05653 , year=
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, 19 Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ...
-
[38]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data, 2024.https://arxiv.org/abs/2410.02713. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena....
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
MLVU: Benchmarking Multi-task Long Video Understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding.arXiv preprint arXiv:2406.04264,
work page internal anchor Pith review arXiv
-
[41]
Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, and Serena Yeung-Levy. Video-star: Self-training enables video instruction tuning with any supervision, 2024a.https://arxiv.org/abs/2407.06189. Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, et al. Apollo...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.