arxiv: 2407.07726 · v2 · submitted 2024-07-10 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 3 theorem links

· Lean Theorem

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer , Andreas Steiner , Andr\'e Susano Pinto , Alexander Kolesnikov , Xiao Wang , Daniel Salz , Maxim Neumann , Ibrahim Alabdulmohsin

show 27 more authors

Michael Tschannen Emanuele Bugliarello Thomas Unterthiner Daniel Keysers Skanda Koppula Fangyu Liu Adam Grycner Alexey Gritsenko Neil Houlsby Manoj Kumar Keran Rong Julian Eisenschlos Rishabh Kabra Matthias Bauer Matko Bo\v{s}njak Xi Chen Matthias Minderer Paul Voigtlaender Ioana Bica Ivana Balazevic Joan Puigcerver Pinelopi Papalampidi Olivier Henaff Xi Xiong Radu Soricut Jeremiah Harmsen Xiaohua Zhai

Authors on Pith no claims yet

Pith reviewed 2026-05-11 13:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords vision-language modeltransfer learningmultimodal AIopen weightscomputer visionimage captioningvisual question answering

0 comments

The pith

PaliGemma fuses a SigLIP vision encoder with the Gemma-2B language model to produce a 3B open VLM that transfers effectively to nearly 40 tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PaliGemma as an open vision-language model built from existing components and trained to serve as a versatile base. It claims this model delivers strong results across standard benchmarks plus specialized domains such as remote sensing and segmentation. A reader would care because a single pretrained model that adapts readily could reduce the cost and data needed for new multimodal applications. The work emphasizes openness so others can build upon or fine-tune the same weights rather than starting from scratch.

Core claim

PaliGemma is a 3-billion-parameter vision-language model that combines the SigLIP-So400m vision encoder with the Gemma-2B language model. When trained with a recipe intended to create a broadly knowledgeable base, it achieves strong performance on almost 40 diverse open-world tasks without requiring task-specific architectures.

What carries the argument

The integration of the SigLIP vision encoder and Gemma language model under a training procedure that aims to produce transferable multimodal capabilities.

Load-bearing premise

The training procedure applied to these two components produces a model whose effectiveness on the reported tasks holds without undisclosed choices in task selection or evaluation details.

What would settle it

An independent training run of the identical architecture and data recipe that measures accuracy on the same set of tasks and finds performance that deviates substantially from the published numbers.

read the original abstract

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PaliGemma is a straightforward open 3B VLM release that combines SigLIP and Gemma-2B into a transferable base, with the value hinging on whether the full paper backs the broad claims with concrete numbers and protocols.

read the letter

The paper introduces PaliGemma, a 3B-parameter vision-language model built from the SigLIP-So400m encoder and the Gemma-2B language model. It is trained as a versatile starting point and then tested on nearly 40 tasks that mix standard VLM benchmarks with more specialized ones such as remote sensing and segmentation. The main new element is the trained model itself plus the demonstration that this particular combination transfers reasonably well without heavy per-task engineering.

Referee Report

3 major / 2 minor

Summary. The paper introduces PaliGemma, an open 3B-parameter Vision-Language Model combining the SigLIP-So400m vision encoder with the Gemma-2B language model. It is positioned as a versatile base model trained for broad knowledge and effective transfer, with claims of strong performance across nearly 40 diverse tasks that include standard VLM benchmarks as well as specialized domains such as remote-sensing and segmentation.

Significance. If the transfer-effectiveness claims hold under rigorous evaluation, the release of this open VLM could provide a practical starting point for multimodal research, reducing the need for task-specific pretraining from scratch and extending utility to non-standard domains like remote sensing.

major comments (3)

[Abstract] Abstract: the central claim of 'strong performance' and 'versatile' transfer on ~40 tasks is asserted without any quantitative results, baselines, error bars, or even a high-level summary of metrics; this absence is load-bearing because the generalization assertion cannot be assessed from the given text.
[Evaluation] Evaluation section (implied by the task count): no description is provided of the data mixture, training objective, per-task prompt templates, or metric implementations, leaving the weakest assumption (that the SigLIP+Gemma recipe yields generalizable transfer without post-hoc selection) untestable.
[Methods] Methods: the precise recipe for combining SigLIP-So400m and Gemma-2B (including any continued pretraining or alignment stages) is unspecified, which directly affects reproducibility and the claim that this particular combination is broadly effective.

minor comments (2)

[Abstract] The model is described as '3B' while the components are listed as So400m + 2B; a brief clarification of total parameter count or whether the figure is approximate would avoid confusion.
A summary table listing the ~40 tasks with primary metrics and comparisons to prior open VLMs would substantially improve readability and allow readers to verify the breadth of the evaluation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We address each major comment below, providing clarifications from the manuscript and indicating revisions that will strengthen the presentation of our claims, methods, and evaluations.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'strong performance' and 'versatile' transfer on ~40 tasks is asserted without any quantitative results, baselines, error bars, or even a high-level summary of metrics; this absence is load-bearing because the generalization assertion cannot be assessed from the given text.

Authors: We agree that the abstract would benefit from a concise quantitative anchor to make the 'strong performance' and 'versatile' claims more immediately assessable. The full manuscript contains extensive tables and figures with per-task metrics, baselines, and comparisons, but these are not summarized at the abstract level. In the revision we will add one or two sentences providing a high-level overview (e.g., number of tasks where PaliGemma matches or exceeds prior open models, reference to the main result tables) while preserving the abstract's brevity. revision: yes
Referee: [Evaluation] Evaluation section (implied by the task count): no description is provided of the data mixture, training objective, per-task prompt templates, or metric implementations, leaving the weakest assumption (that the SigLIP+Gemma recipe yields generalizable transfer without post-hoc selection) untestable.

Authors: The manuscript does contain a training and evaluation section that specifies the overall data mixture, the multimodal training objective, and the evaluation protocol. However, we acknowledge that the level of detail on per-task prompt templates and exact metric implementations could be more explicit to allow readers to verify the absence of post-hoc selection. We will expand this section with additional specifics on the data composition, objective formulation, representative prompt templates, and metric definitions for the ~40 tasks. revision: yes
Referee: [Methods] Methods: the precise recipe for combining SigLIP-So400m and Gemma-2B (including any continued pretraining or alignment stages) is unspecified, which directly affects reproducibility and the claim that this particular combination is broadly effective.

Authors: The methods section describes the architectural integration of the SigLIP-So400m encoder with the Gemma-2B decoder, the joint training procedure, and the stages used to produce the final model. To improve reproducibility we will add more granular details on the exact combination mechanism, any continued pretraining or alignment phases, hyper-parameters, and initialization choices. This will make the recipe fully specified while preserving the core claim that the SigLIP+Gemma pairing enables broad transfer. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model release with no derivation chain

full rationale

The paper presents PaliGemma as an open VLM combining the SigLIP-So400m vision encoder and Gemma-2B language model, trained for versatile transfer and evaluated on nearly 40 tasks. No equations, first-principles derivations, predictions, or fitted parameters are introduced that could reduce to inputs by construction. Claims rest entirely on external benchmark results rather than internal self-definitions or self-citation chains. This is a standard empirical model release whose central assertions are externally falsifiable and independent of any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the model is assembled from existing published components (SigLIP, Gemma) whose internal details are treated as given.

pith-pipeline@v0.9.0 · 5540 in / 1070 out tokens · 53866 ms · 2026-05-11T13:03:47.236577+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PaliGemma consists of three components: An image encoder... A decoder-only language model... A linear layer projecting SigLIP’s output tokens into the same dimensions as Gemma-2B’s vocab tokens

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 56 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
cs.RO 2026-04 unverdicted novelty 8.0

RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
Dynamic Execution Commitment of Vision-Language-Action Models
cs.CV 2026-05 unverdicted novelty 7.0

A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.
RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning
cs.RO 2026-05 unverdicted novelty 7.0

RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 conditional novelty 7.0

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 7.0

MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.
CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 7.0

CodeGraphVLP uses a semantic-graph state and executable code planner to enable reliable long-horizon non-Markovian robot manipulation, improving task success and lowering latency over standard VLA baselines.
SketchVLM: Vision language models can annotate images to explain thoughts and guide users
cs.CV 2026-04 unverdicted novelty 7.0

SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
cs.CV 2026-04 conditional novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
cs.CV 2026-04 unverdicted novelty 7.0

DSCA turns concept isolation into an architectural property by dynamically creating orthogonal subspaces for non-interfering lifelong edits in vision-language models, sustaining over 95% success after 1000 sequential edits.
MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
cs.LG 2026-04 unverdicted novelty 7.0

MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faste...
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
cs.RO 2026-03 conditional novelty 7.0

GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
cs.RO 2026-03 unverdicted novelty 7.0

AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
cs.RO 2026-05 unverdicted novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 conditional novelty 6.0

Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 unverdicted novelty 6.0

Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT
cs.CV 2026-05 unverdicted novelty 6.0

Distills 3D spatial reasoning from a 7B teacher VLM to a 2.29B student using VGGT encoder, multi-task losses, and Hidden CoT latent tokens, yielding 8.7x lower latency with 54-72% performance retention on ScanNet and ...
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

AT-VLA introduces adaptive tactile injection and a dual-stream tactile reaction mechanism to integrate real-time tactile feedback into pretrained VLA models for contact-rich robotic manipulation.
Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits
cs.AI 2026-05 unverdicted novelty 6.0

Attention sharpness barely predicts VLM correctness while hidden-state probes and self-consistency strongly do, with late-fusion models showing fragile reliability bottlenecks unlike early-fusion ones.
Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

Interventional attribution via ISS and NMR diagnoses causal misalignment in VLA policies and predicts their generalization performance across manipulation tasks.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
MotuBrain: An Advanced World Action Model for Robot Control
cs.RO 2026-04 unverdicted novelty 6.0

MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new rob...
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
cs.RO 2026-04 unverdicted novelty 6.0

DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

AnchorRefine factorizes VLA action generation into a trajectory anchor for coarse planning and residual refinement for local corrections, improving success rates by up to 7.8% in simulation and 18% on real robots acro...
ViLL-E: Video LLM Embeddings for Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
cs.CV 2026-04 unverdicted novelty 6.0

TIPSv2 improves dense patch-text alignment in vision-language pretraining through distillation and iBOT++ modifications, yielding models on par with or better than recent baselines on 9 tasks across 20 datasets.
AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement
cs.RO 2026-04 unverdicted novelty 6.0

AnySlot decouples language grounding from low-level control by inserting an explicit visual goal image, yielding better zero-shot performance on precise slot placement tasks than flat VLA policies.
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
cs.RO 2026-04 unverdicted novelty 6.0

RoboLab is a photorealistic simulation benchmark with 120 tasks and perturbation analysis to evaluate true generalization and robustness of robotic foundation models.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
cs.RO 2026-04 unverdicted novelty 6.0

VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
cs.RO 2026-04 conditional novelty 6.0

MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models
cs.CV 2026-04 conditional novelty 6.0

UAV-Track VLA modifies the π0.5 VLA architecture with temporal compression and dual-branch decoding to reach 61.76% success and 269.65 average frames in long-distance pedestrian tracking on a new 890K-frame UAV datase...
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
cs.RO 2025-12 unverdicted novelty 6.0

mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.
Perception Encoder: The best visual embeddings are not at the output of the network
cs.CV 2025-04 unverdicted novelty 6.0

Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, a...
FAST: Efficient Action Tokenization for Vision-Language-Action Models
cs.RO 2025-01 unverdicted novelty 6.0

FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
cs.LG 2024-10 unverdicted novelty 6.0

π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
Exploring and Exploiting Stability in Latent Flow Matching
cs.LG 2026-05 unverdicted novelty 5.0

Latent Flow Matching models exhibit inherent stability to data reduction and model shrinkage due to the flow matching objective, enabling reduced-dataset training and two-stage inference with over 2x speedup while pre...
Let ViT Speak: Generative Language-Image Pre-training
cs.CV 2026-05 unverdicted novelty 5.0

GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 5.0

VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
The Amazing Stability of Flow Matching
cs.CV 2026-04 unverdicted novelty 5.0

Flow matching generative models preserve sample quality, diversity, and latent representations despite pruning 50% of the CelebA-HQ dataset or altering architecture and training configurations.
AEGIS: Anchor-Enforced Gradient Isolation for Knowledge-Preserving Vision-Language-Action Fine-Tuning
cs.LG 2026-04 unverdicted novelty 5.0

AEGIS uses a pre-computed Gaussian anchor and layer-wise Gram-Schmidt orthogonal projections to isolate destructive gradients during VLA fine-tuning, preserving VQA performance without co-training or replay.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
cs.CV 2026-04 unverdicted novelty 5.0

Dual-encoder VLMs gain robust compositional generalization by learning localized alignments from frozen patch and token embeddings instead of using global similarity.
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
ZAYA1-VL-8B Technical Report
cs.CV 2026-05 unverdicted novelty 4.0

ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.
PLaMo 2.1-VL Technical Report
cs.CV 2026-04 unverdicted novelty 4.0

PLaMo 2.1-VL reports 61.5 ROUGE-L on JA-VG-VQA-500, 85.2% on Japanese Ref-L4, 53.9% zero-shot factory accuracy, and raises anomaly detection F1 from 39.7 to 64.9 after fine-tuning.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
cs.CV 2025-02 unverdicted novelty 4.0

SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...
PaliGemma 2: A Family of Versatile VLMs for Transfer
cs.CV 2024-12 unverdicted novelty 4.0

PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at ...

Reference graph

Works this paper leans on

167 extracted references · 167 canonical work pages · cited by 50 Pith papers · 10 internal anchors

[1]

Acharya, K

M. Acharya, K. Kafle, and C. Kanan. Tal- lyqa: Answering complex counting ques- tions. InAAAI, 2019

work page 2019
[2]

arXiv preprint arXiv:2201.07520 , year=

A. Aghajanyan, B. Huang, C. Ross, V.Karpukhin,H.Xu,N.Goyal,D.Okhonko, M.Joshi,G.Ghosh,M.Lewis,andL.Zettle- moyer. CM3: A causal masked mul- timodal model of the internet. CoRR, abs/2201.07520, 2022. URL https:// arxiv.org/abs/2201.07520

work page arXiv 2022
[3]

Aghajanyan, L

A. Aghajanyan, L. Yu, A. Conneau, W. Hsu, K. Hambardzumyan, S. Zhang, S. Roller, N. Goyal, O. Levy, and L. Zettle- moyer. Scaling laws for generative mixed-modal language models. In A. Krause, E. Brunskill, K. Cho, B. Engel- hardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawa...

work page 2023
[4]

Agrawal, K

H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson. nocaps: novel ob- ject captioning at scale. InProceedings of the IEEE International Conference on Com- puter Vision, pages 8948–8957, 2019

work page 2019
[5]

Alabdulmohsin, X

I. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. In NeurIPS, 2023

work page 2023
[6]

Flamingo: a Visual Language Model for Few-Shot Learning

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Men- sch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O.Vinyals, A.Zisserman, andK.Simonyan. Flamingo: a visual language model f...

work page internal anchor Pith review arXiv 2022
[7]

R. Anil, S. Borgeaud, Y. Wu, J. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, D. Silver, S. Petrov, M. Johnson, I. Antonoglou, J. Schrit- twieser, A. Glaese, J. Chen, E. Pitler, T. P. Lillicrap, A. Lazaridou, O. Firat, J. Mol- loy, M. Isard, P. R. Barham, T. Hennigan, B. Lee, F. Viola, M. Reynolds, Y. Xu, R. Do- herty, E. Col...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

E. K. M. S. H. H. A. F. Aniruddha Kemb- havi, Mike Salvato. A diagram is worth a dozen images. In European Confer- ence on Computer Vision (ECCV), 2016. URL https://api.semanticscholar. org/CorpusID:2682274

work page 2016
[10]

Baechler, S

G. Baechler, S. Sunkara, M. Wang, F. Zubach, H. Mansoor, V. Etter, V. Căr- bune, J. Lin, J. Chen, and A. Sharma. Screenai: A vision-language model for ui and infographics understanding, 2024. URL https://arxiv.org/abs/2402. 04615

work page 2024
[11]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. CoRR, abs/2308.12966, 2023. doi: 10.48550/ ARXIV.2308.12966. URL https://doi. org/10.48550/arXiv.2308.12966

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.12966 2023
[12]

Odena, A

R.Bavishi,E.Elsen,C.Hawthorne,M.Nye, A. Odena, A. Somani, and S. Taşır- lar. Introducing our multimodal models,

work page
[13]

URL https://www.adept.ai/ blog/fuyu-8b

work page
[14]

Beyer, X

L. Beyer, X. Zhai, and A. Kolesnikov. Big vision. https://github.com/ google-research/big_vision, 2022

work page 2022
[15]

Beyer, P

L. Beyer, P. Izmailov, A. Kolesnikov, M. Caron, S. Kornblith, X. Zhai, M. Min- derer, M. Tschannen, I. Alabdulmohsin, and F. Pavetic. Flexivit: One model for all patch sizes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14496–14506, 2023

work page 2023
[16]

S., Bugliarello, E., Wang, X., Yu, Q., Chen, L.-C., and Zhai, X

L. Beyer, B. Wan, G. Madan, F. Pavetic, A. Steiner, A. Kolesnikov, A. S. Pinto, E. Bugliarello, X. Wang, Q. Yu, et al. A study of autoregressive decoders for multi- tasking in computer vision.arXiv preprint arXiv:2303.17376, 2023

work page arXiv 2023
[17]

A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, C. Jawahar, E. Valveny, and D. Karatzas. Scene text visual question an- swering. In2019 IEEE/CVF International Conference on Computer Vision (ICCV) . IEEE, Oct. 2019. doi: 10.1109/iccv.2019. 00439. URL http://dx.doi.org/10. 1109/ICCV.2019.00439

work page doi:10.1109/iccv.2019 2019
[18]

Bradbury, R

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson,C.Leary,D.Maclaurin,G.Necula, A. Paszke, J. VanderPlas, S. Wanderman- Milne, and Q. Zhang. JAX: compos- able transformations of Python+NumPy programs, 2018. URL http://github. com/google/jax

work page 2018
[19]

Bugliarello, F

E. Bugliarello, F. Liu, J. Pfeiffer, S. Reddy, D. Elliott, E. M. Ponti, and I. Vuli’c. IGLUE: A benchmark for transfer learning across modalities, tasks, and languages. In Proceedings of the 39th International Conference on Machine Learning , vol- ume 162 of Proceedings of Machine Learning Research , page 2370–2392, Balitmore, MA, July 2022. PMLR. URL htt...

work page 2022
[20]

J. Cha, W. Kang, J. Mun, and B. Roh. Hon- eybee: Locality-enhanced projector for multimodal LLM.CoRR, abs/2312.06742, 15 PaliGemma: A versatile 3B VLM for transfer

work page arXiv
[21]

doi: 10.48550/ARXIV.2312. 06742. URL https://doi.org/10. 48550/arXiv.2312.06742

work page doi:10.48550/arxiv.2312
[22]

Changpinyo, D

S. Changpinyo, D. Kukliansy, I. Szpektor, X. Chen, N. Ding, and R. Soricut. All you may need for VQA are image cap- tions. In M. Carpuat, M.-C. de Marn- effe, and I. V. Meza Ruiz, editors, Pro- ceedings of the 2022 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Hu- man Language Technologies, pages 1947– 1963, ...

work page doi:10.18653/v1/2022.naacl-main 2022
[23]

org/2022.naacl-main.142

URL https://aclanthology. org/2022.naacl-main.142

work page 2022
[24]

D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase eval- uation. In D. Lin, Y. Matsumoto, and R. Mihalcea, editors, The 49th Annual Meeting of the Association for Computa- tional Linguistics: Human Language Tech- nologies, Proceedings of the Conference, 19- 24 June, 2011, Portland, Oregon, USA, pages 190–200. The Association for Co...

work page 2011
[25]

L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin. Sharegpt4v: Improving large multi-modal modelswithbettercaptions. arXivpreprint arXiv:2311.12793, 2023

work page internal anchor Pith review arXiv 2023
[26]

T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. E. Hinton. Pix2seq: A language mod- eling framework for object detection. In ICLR, 2022

work page 2022
[27]

X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, A. Kolesnikov, J. Puigcerver, N. Ding, K. Rong, H. Akbari, G. Mishra, L. Xue, A. Thapliyal, J. Bradbury, W. Kuo, M. Seyedhosseini, C. Jia, B. K. Ayan, C. Riquelme, A. Steiner, A. Angelova, X. Zhai, N. Houlsby, and R. Soricut. PaLI: A j...

work page arXiv 2022
[28]

X. Chen, J. Djolonga, P. Padlewski, B. Mustafa, S. Changpinyo, J. Wu, C. R. Ruiz,S.Goodman,X.Wang,Y.Tay,S.Shak- eri, M. Dehghani, D. Salz, M. Lucic, M.Tschannen, A.Nagrani, H.Hu, M.Joshi, B. Pang, C. Montgomery, P. Pietrzyk, M. Ritter, A. J. Piergiovanni, M. Minderer, F. Pavetic, A. Waters, G. Li, I. Alabdul- mohsin, L. Beyer, J. Amelot, K. Lee, A. P. Ste...

work page arXiv 2023
[29]

X. Chen, X. Wang, L. Beyer, A. Kolesnikov, J.Wu, P.Voigtlaender, B.Mustafa, S.Good- man, I. Alabdulmohsin, P. Padlewski, D. Salz, X. Xiong, D. Vlasic, F. Pavetic, K. Rong, T. Yu, D. Keysers, X. Zhai, and R. Soricut. PaLI-3 vision language mod- els: Smaller, faster, stronger. CoRR, arXiv:2310.09199, 2023

work page arXiv 2023
[30]

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P. Luo, T. Lu, Y. Qiao, and J. Dai. Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023

work page internal anchor Pith review arXiv 2023
[31]

J. Cho, J. Lei, H. Tan, and M. Bansal. Uni- fying vision-and-language tasks via text generation. In M. Meila and T. Zhang, ed- itors, Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Re- search, pages 1931–1942. PMLR, 18–24 Jul 2021. URLhttps://proceedings. mlr.press/v139/cho21a.html

work page 1931
[32]

Chowdhery, S

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sut- ton, S. Gehrmann, P. Schuh, K. Shi, 16 PaliGemma: A versatile 3B VLM for transfer S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prab- hakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P....

work page 2023
[33]

Dehghani, J

M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos, I. Al- abdulmohsin, R. Jenatton, L. Beyer, M. Tschannen, A. Arnab, X. Wang, C. R. Ruiz, M. Minderer, J. Puigcerver, U. Evci, M. Kumar, S. van Steenkiste, G. F. El- sayed, A. Mahendran, F. Yu, A. Oliver, F. Huot, J. Bastings, M. Collier, A. A. Grits...

work page 2023
[34]

Dehghani, B

M. Dehghani, B. Mustafa, J. Djolonga, J. Heek, M. Minderer, M. Caron, A. Steiner, J. Puigcerver, R. Geirhos, I. M. Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and res- olution. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[35]

In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR)

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of anno- tated 3d objects. InIEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 13142–13153. IEEE, 2023. doi: 10.1109/CVPR52729. 2023.0126...

work page doi:10.1109/cvpr52729 2023
[36]

Desai and J

K. Desai and J. Johnson. Virtex: Learning visual representations from textual anno- tations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11162–11173, 2021

work page 2021
[37]

BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for lan- guage understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Pro- ceedings of the 2019 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Hu- man Language Technologies, NAACL-HLT 2019, Minneapol...

work page doi:10.18653/v1/n19-1423 2019
[38]

H. Diao, Y. Cui, X. Li, Y. Wang, H. Lu, and X. Wang. Unveiling encoder-free vision- language models, 2024. URLhttps:// arxiv.org/abs/2406.11832

work page arXiv 2024
[39]

X. Dong, P. Zhang, Y. Zang, Y. Cao, B. Wang, L. Ouyang, S. Zhang, H. Duan, W. Zhang, Y. Li, H. Yan, Y. Gao, Z. Chen, X. Zhang, W. Li, J. Li, W. Wang, K. Chen, C. He, X. Zhang, J. Dai, Y. Qiao, D. Lin, and J. Wang. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd, 2024. URL https://arxiv.org/ ...

work page arXiv 2024
[40]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A.Chowdhery, B.Ichter, A.Wahid, J.Tomp- son, Q. Vuong, T. Yu, W. Huang, Y. Chebo- tar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Flo- rence. PaLM-E: An embodied multimodal language model. In A. Krause, E. Brun- skill, K. Cho, B. Engelh...

work page 2023
[41]

mlr.press/v202/driess23a.html

URL https://proceedings. mlr.press/v202/driess23a.html

work page
[42]

Geigle, R

G. Geigle, R. Timofte, and G. Glavaš. African or european swallow? bench- marking large vision-language models for fine-grained object classification, 2024. URL https://arxiv.org/abs/2406. 14496

work page 2024
[43]

Introduction to Cloud TPU

Google Cloud. Introduction to Cloud TPU. https://cloud.google.com/ tpu/docs/intro-to-tpu, 20xx. Ac- cessed: 2024-07-04

work page 2024
[44]

Goyal, T

Y. Goyal, T. Khot, D. Summers-Stay, D. Ba- tra, and D. Parikh. Making the V in VQA matter: Elevating the role of image under- standing in Visual Question Answering. In Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017
[45]

Grauman, J

D.Gurari, Q.Li, A.J.Stangl, A.Guo, C.Lin, K. Grauman, J. Luo, and J. P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceed- ings of the IEEE conference on computer vi- sion and pattern recognition, pages 3608– 3617, 2018

work page 2018
[46]

M. He, Y. Liu, B. Wu, J. Yuan, Y. Wang, T. Huang, and B. Zhao. Efficient multimodal learning from data-centric perspective. CoRR, abs/2402.11530,

work page arXiv
[47]

doi: 10.48550/ARXIV.2402. 11530. URL https://doi.org/10. 48550/arXiv.2402.11530

work page doi:10.48550/arxiv.2402
[48]

J. Hewitt. Initializing new word em- beddings for pretrained language models. https:/nlp.stanford.edu/ ~johnhew//vocab-expansion.html, 2021

work page 2021
[49]

Hsieh, J

C.-Y. Hsieh, J. Zhang, Z. Ma, A. Kembhavi, and R. Krishna. Sugarcrepe: fixing hack- able benchmarks for vision-language com- positionality. In Proceedings of the 37th International Conference on Neural Infor- mation Processing Systems, NIPS ’23, Red Hook, NY, USA, 2024. Curran Associates Inc

work page 2024
[50]

T.-Y. Hsu, C. L. Giles, and T.-H. Huang. Scicap: Generating captions for scientific figures. arXiv preprint arXiv:2110.11624, 2021

work page arXiv 2021
[51]

Huang, L

S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, T. Lv, L. Cui, O. K. Mohammed, B. Patra, Q. Liu, K. Aggar- wal, Z. Chi, N. J. B. Bjorck, V. Chaudhary, S. Som, X. Song, and F. Wei. Language is not all you need: Aligning perception with language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural...

work page 2023
[52]

Huang, Z

Y. Huang, Z. Meng, F. Liu, Y. Su, N. Col- lier, and Y. Lu. Sparkles: Unlock- ing chats across multiple images for multimodal instruction-following models,

work page
[53]

URLhttps://openreview.net/ forum?id=oq5EF8parZ

work page
[54]

Hudson and Christopher D

D. Hudson and C. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. Computer Vision and Pattern Recogni- tion (CVPR) , abs/1902.09506, 2019. doi: 10.48550/arXiv.1902.09506. URL https://doi.org/10.48550/arXiv. 1902.09506

work page doi:10.48550/arxiv.1902.09506 1902
[55]

Jaegle, F

A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira. Perceiver: 18 PaliGemma: A versatile 3B VLM for transfer General perception with iterative atten- tion. In M. Meila and T. Zhang, edi- tors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, vol- ume 139 ofProceedings o...

work page 2021
[56]

press/v139/jaegle21a.html

URL http://proceedings.mlr. press/v139/jaegle21a.html

work page
[57]

C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learn- ing with noisy text supervision. In M. Meila and T. Zhang, editors, Pro- ceedings of the 38th International Confer- ence on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Pro...

work page 2021
[58]

press/v139/jia21b.html

URL http://proceedings.mlr. press/v139/jia21b.html

work page
[59]

Kabra, L

R. Kabra, L. Matthey, A. Lerchner, and N. J. Mitra. Leveraging vlm-based pipelines to annotate 3d objects. InProceedings of the 41st International Conference on Machine Learning, volume235of ProceedingsofMa- chine Learning Research. PMLR, 2024

work page 2024
[60]

Kajić, O

I. Kajić, O. Wiles, I. Albuquerque, M. Bauer, S. Wang, J. Pont-Tuset, and A. Ne- matzadeh. Evaluating numerical reason- ing in text-to-image models, 2024

work page 2024
[61]

Karamcheti, S

S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh. Pris- matic vlms: Investigating the design space of visually-conditioned language models,

work page
[62]

URL https://arxiv.org/abs/ 2402.07865

work page arXiv
[63]

R efer I t G ame: Referring to objects in photographs of natural scenes

S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. ReferItGame: Referring to objects in photographs of natural scenes. In A. Moschitti, B. Pang, and W. Daele- mans, editors, Proceedings of the 2014 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 787–798, Doha, Qatar, Oct. 2014. Associ- ation for Computational Linguistics. d...

work page doi:10.3115/v1/d14-1086 2014
[64]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. White- head, A. C. Berg, W.-Y. Lo, P. Dollar, and R. Girshick. Segment anything. In Pro- ceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), pages 4015–4026, October 2023

work page 2023
[65]

Kolesnikov, L

A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby. Big transfer (BiT): General vi- sual representation learning. InEuropean Conference on Computer Vision (ECCV), 2020

work page 2020
[66]

Kolesnikov, A

A. Kolesnikov, A. S. Pinto, L. Beyer, X. Zhai, J. Harmsen, and N. Houlsby. Uvim: A unified modeling approach for vision with learned guiding codes. In S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural In- formation Processing Systems 35: Annual Conference on Neural Information Process- ing Systems 2022, NeurIP...

work page 2022
[67]

Krishna, K

R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE in- ternational conference on computer vision, pages 706–715, 2017

work page 2017
[68]

SentencePiece:

T. Kudo and J. Richardson. SentencePiece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing. In E. Blanco and W. Lu, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, Nov.2018.Asso- ciation for Computational ...

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
[69]

Laurençon, L

H. Laurençon, L. Tronchon, M. Cord, and V. Sanh. What matters when build- ing vision-language models? CoRR, abs/2405.02246, 2024. doi: 10.48550/ 19 PaliGemma: A versatile 3B VLM for transfer ARXIV.2405.02246. URL https://doi. org/10.48550/arXiv.2405.02246

work page doi:10.48550/arxiv.2405.02246 2024
[70]

Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh

H. Laurençon, L. Saulnier, L. Tronchon, S. Bekman, A. Singh, A. Lozhkov, T. Wang, S. Karamcheti, A. M. Rush, D. Kiela, M. Cord, and V. Sanh. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023. URLhttps: //arxiv.org/abs/2306.16527

work page arXiv 2023
[71]

J. Li, D. Li, C. Xiong, and S. C. H. Hoi. BLIP: bootstrapping language- image pre-training for unified vision- language understanding and generation. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, editors, International Conference on Ma- chine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA , vol- ume 162 ofProc...

work page 2022
[72]

mlr.press/v162/li22n.html

URL https://proceedings. mlr.press/v162/li22n.html

work page
[73]

J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language- image pre-training with frozen image en- coders and large language models. In A. Krause, E. Brunskill, K. Cho, B. En- gelhardt, S. Sabato, and J. Scarlett, ed- itors, International Conference on Ma- chine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of...

work page 2023
[74]

mlr.press/v202/li23q.html

URL https://proceedings. mlr.press/v202/li23q.html

work page
[75]

Y. Li, G. Li, L. He, J. Zheng, H. Li, and Z. Guan. Widget captioning: Generat- ing natural language description for mo- bileuser interface elements. InConference on Empirical Methods in Natural Language Processing, 2020

work page 2020
[76]

Y. Li, H. Mao, R. Girshick, and K. He. Exploring plain vision transformer backbones for object detection, 2022. URL https: //arxiv.org/abs/2203.16527

work page arXiv 2022
[77]

Y.Li,Y.Zhang,C.Wang,Z.Zhong,Y.Chen, R. Chu, S. Liu, and J. Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024

work page arXiv 2024
[78]

B. Lin, Z. Tang, Y. Ye, J. Cui, B. Zhu, P. Jin, J. Huang, J. Zhang, Y. Pang, M. Ning, and L. Yuan. Moe-llava: Mixture of ex- perts for large vision-language models,

work page
[79]

URL https://arxiv.org/abs/ 2401.15947

work page arXiv
[80]

T. Lin, M. Maire, S. J. Belongie, L. D. Bour- dev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll’a r, and C. L. Zitnick. Microsoft COCO: common objects in con- text. CoRR, abs/1405.0312, 2014. URL http://arxiv.org/abs/1405.0312

work page internal anchor Pith review arXiv 2014
[81]

F. Liu, E. Bugliarello, E. M. Ponti, S. Reddy, N. Collier, and D. Elliott. Visually grounded reasoning across languages and cultures. In M.-F. Moens, X. Huang, L. Spe- cia, and S. W.-t. Yih, editors, Proceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467–10485, Online and Punta Cana, Dominican Republic, Nov. ...

work page doi:10.18653/v1/2021.emnlp-main 2021

Showing first 80 references.