PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Alexander Kolesnikov; Basil Mustafa; Daniel Keysers; Daniel Salz; Daniel Vlasic; Filip Pavetic; Ibrahim Alabdulmohsin; Jialin Wu; Keran Rong; Lucas Beyer

arxiv: 2310.09199 · v2 · pith:WXLBAWUNnew · submitted 2023-10-13 · 💻 cs.CV

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Xi Chen , Xiao Wang , Lucas Beyer , Alexander Kolesnikov , Jialin Wu , Paul Voigtlaender , Basil Mustafa , Sebastian Goodman

show 11 more authors

Ibrahim Alabdulmohsin Piotr Padlewski Daniel Salz Xi Xiong Daniel Vlasic Filip Pavetic Keran Rong Tianli Yu Daniel Keysers Xiaohua Zhai Radu Soricut

This is my paper

classification 💻 cs.CV

keywords modelspali-3visionbenchmarksclassificationfasterimagelanguage

0 comments

read the original abstract

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models
cs.CV 2026-05 unverdicted novelty 7.0

CrossVLA develops a surrogate log-probability estimator for DPO on flow-matching VLAs, shows DoRA outperforming LoRA by +10.4 pp mean on LIBERO, and identifies inference bottlenecks with limited caching gains.
CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models
cs.CV 2026-05 conditional novelty 7.0

CrossVLA introduces a surrogate log-probability estimator to enable DPO on flow-matching VLAs, reports DoRA yielding +10.4 pp mean gains over SFT on LIBERO with 600 trials, and shows inference caching limited to 21% s...
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders
cs.CV 2026-05 unverdicted novelty 6.0

C-GSPN scales 2D spatial propagation to foundation vision encoders via a fast CUDA kernel, compressed blocks, and two-stage distillation, matching ViT performance with 15% fewer parameters and 4x block speedup at 2K r...
PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding
cs.CV 2026-05 unverdicted novelty 6.0

PARCEL is a new visual tokenization architecture combining pool-anchored resampling with conditioned elastic queries to enhance performance-efficiency tradeoffs in LVLMs over prior matryoshka methods.
CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion
cs.RO 2026-01 unverdicted novelty 6.0

CLARE is an exemplar-free continual learning framework for VLAs that autonomously expands modular adapters based on feature similarity and uses autoencoder routing for label-free deployment.
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
cs.LG 2025-06 unverdicted novelty 6.0

SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
Context and Pixel Aware Large Language Model for Video Quality Assessment
cs.CV 2025-05 unverdicted novelty 6.0

CP-LLM uses dual vision encoders in a multimodal LLM to separately handle video context and pixel distortions, then reasons about both to output quality scores and descriptions with claimed SOTA cross-dataset performance.
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
cs.CV 2025-03 unverdicted novelty 6.0

CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
cs.RO 2024-12 conditional novelty 6.0

Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.
OpenVLA: An Open-Source Vision-Language-Action Model
cs.RO 2024-06 unverdicted novelty 6.0

OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
MAGNIFIED: RL Fine-tuning of Multimodal Large Language Models for Motion Planning
cs.RO 2026-06 unverdicted novelty 5.0

MAGNIFIED applies RL fine-tuning to MLLMs for autonomous driving motion planning, yielding over 10.5% lower overlap rate and 38.9% lower off-road rate than SFT baseline on Waymo Open Motion Dataset.
Weak-to-Strong Knowledge Distillation Accelerates Visual Learning
cs.CV 2026-04 unverdicted novelty 5.0

Weak-to-strong knowledge distillation applied early and then turned off accelerates convergence to target performance in visual learning tasks by factors of 1.7-4.8x.
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention
cs.LG 2025-11 unverdicted novelty 5.0

AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.
On The Application of Linear Attention in Multimodal Transformers
cs.CV 2026-04 unverdicted novelty 4.0

Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.
PaliGemma 2: A Family of Versatile VLMs for Transfer
cs.CV 2024-12 unverdicted novelty 4.0

PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at ...
PaliGemma: A versatile 3B VLM for transfer
cs.CV 2024-07 unverdicted novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.