pith. machine review for the scientific record. sign in

arxiv: 2407.07726 · v2 · submitted 2024-07-10 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 3 theorem links

· Lean Theorem

PaliGemma: A versatile 3B VLM for transfer

Authors on Pith no claims yet

Pith reviewed 2026-05-11 13:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords vision-language modeltransfer learningmultimodal AIopen weightscomputer visionimage captioningvisual question answering
0
0 comments X

The pith

PaliGemma fuses a SigLIP vision encoder with the Gemma-2B language model to produce a 3B open VLM that transfers effectively to nearly 40 tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PaliGemma as an open vision-language model built from existing components and trained to serve as a versatile base. It claims this model delivers strong results across standard benchmarks plus specialized domains such as remote sensing and segmentation. A reader would care because a single pretrained model that adapts readily could reduce the cost and data needed for new multimodal applications. The work emphasizes openness so others can build upon or fine-tune the same weights rather than starting from scratch.

Core claim

PaliGemma is a 3-billion-parameter vision-language model that combines the SigLIP-So400m vision encoder with the Gemma-2B language model. When trained with a recipe intended to create a broadly knowledgeable base, it achieves strong performance on almost 40 diverse open-world tasks without requiring task-specific architectures.

What carries the argument

The integration of the SigLIP vision encoder and Gemma language model under a training procedure that aims to produce transferable multimodal capabilities.

Load-bearing premise

The training procedure applied to these two components produces a model whose effectiveness on the reported tasks holds without undisclosed choices in task selection or evaluation details.

What would settle it

An independent training run of the identical architecture and data recipe that measures accuracy on the same set of tasks and finds performance that deviates substantially from the published numbers.

read the original abstract

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces PaliGemma, an open 3B-parameter Vision-Language Model combining the SigLIP-So400m vision encoder with the Gemma-2B language model. It is positioned as a versatile base model trained for broad knowledge and effective transfer, with claims of strong performance across nearly 40 diverse tasks that include standard VLM benchmarks as well as specialized domains such as remote-sensing and segmentation.

Significance. If the transfer-effectiveness claims hold under rigorous evaluation, the release of this open VLM could provide a practical starting point for multimodal research, reducing the need for task-specific pretraining from scratch and extending utility to non-standard domains like remote sensing.

major comments (3)
  1. [Abstract] Abstract: the central claim of 'strong performance' and 'versatile' transfer on ~40 tasks is asserted without any quantitative results, baselines, error bars, or even a high-level summary of metrics; this absence is load-bearing because the generalization assertion cannot be assessed from the given text.
  2. [Evaluation] Evaluation section (implied by the task count): no description is provided of the data mixture, training objective, per-task prompt templates, or metric implementations, leaving the weakest assumption (that the SigLIP+Gemma recipe yields generalizable transfer without post-hoc selection) untestable.
  3. [Methods] Methods: the precise recipe for combining SigLIP-So400m and Gemma-2B (including any continued pretraining or alignment stages) is unspecified, which directly affects reproducibility and the claim that this particular combination is broadly effective.
minor comments (2)
  1. [Abstract] The model is described as '3B' while the components are listed as So400m + 2B; a brief clarification of total parameter count or whether the figure is approximate would avoid confusion.
  2. A summary table listing the ~40 tasks with primary metrics and comparisons to prior open VLMs would substantially improve readability and allow readers to verify the breadth of the evaluation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We address each major comment below, providing clarifications from the manuscript and indicating revisions that will strengthen the presentation of our claims, methods, and evaluations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'strong performance' and 'versatile' transfer on ~40 tasks is asserted without any quantitative results, baselines, error bars, or even a high-level summary of metrics; this absence is load-bearing because the generalization assertion cannot be assessed from the given text.

    Authors: We agree that the abstract would benefit from a concise quantitative anchor to make the 'strong performance' and 'versatile' claims more immediately assessable. The full manuscript contains extensive tables and figures with per-task metrics, baselines, and comparisons, but these are not summarized at the abstract level. In the revision we will add one or two sentences providing a high-level overview (e.g., number of tasks where PaliGemma matches or exceeds prior open models, reference to the main result tables) while preserving the abstract's brevity. revision: yes

  2. Referee: [Evaluation] Evaluation section (implied by the task count): no description is provided of the data mixture, training objective, per-task prompt templates, or metric implementations, leaving the weakest assumption (that the SigLIP+Gemma recipe yields generalizable transfer without post-hoc selection) untestable.

    Authors: The manuscript does contain a training and evaluation section that specifies the overall data mixture, the multimodal training objective, and the evaluation protocol. However, we acknowledge that the level of detail on per-task prompt templates and exact metric implementations could be more explicit to allow readers to verify the absence of post-hoc selection. We will expand this section with additional specifics on the data composition, objective formulation, representative prompt templates, and metric definitions for the ~40 tasks. revision: yes

  3. Referee: [Methods] Methods: the precise recipe for combining SigLIP-So400m and Gemma-2B (including any continued pretraining or alignment stages) is unspecified, which directly affects reproducibility and the claim that this particular combination is broadly effective.

    Authors: The methods section describes the architectural integration of the SigLIP-So400m encoder with the Gemma-2B decoder, the joint training procedure, and the stages used to produce the final model. To improve reproducibility we will add more granular details on the exact combination mechanism, any continued pretraining or alignment phases, hyper-parameters, and initialization choices. This will make the recipe fully specified while preserving the core claim that the SigLIP+Gemma pairing enables broad transfer. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model release with no derivation chain

full rationale

The paper presents PaliGemma as an open VLM combining the SigLIP-So400m vision encoder and Gemma-2B language model, trained for versatile transfer and evaluated on nearly 40 tasks. No equations, first-principles derivations, predictions, or fitted parameters are introduced that could reduce to inputs by construction. Claims rest entirely on external benchmark results rather than internal self-definitions or self-citation chains. This is a standard empirical model release whose central assertions are externally falsifiable and independent of any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the model is assembled from existing published components (SigLIP, Gemma) whose internal details are treated as given.

pith-pipeline@v0.9.0 · 5540 in / 1070 out tokens · 53866 ms · 2026-05-11T13:03:47.236577+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    PaliGemma consists of three components: An image encoder... A decoder-only language model... A linear layer projecting SigLIP’s output tokens into the same dimensions as Gemma-2B’s vocab tokens

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 56 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

    cs.RO 2026-04 unverdicted novelty 8.0

    RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.

  2. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    cs.CV 2024-09 accept novelty 8.0

    Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

  3. Dynamic Execution Commitment of Vision-Language-Action Models

    cs.CV 2026-05 unverdicted novelty 7.0

    A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.

  4. RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning

    cs.RO 2026-05 unverdicted novelty 7.0

    RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.

  5. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 conditional novelty 7.0

    Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

  6. Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 7.0

    MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.

  7. CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 7.0

    CodeGraphVLP uses a semantic-graph state and executable code planner to enable reliable long-horizon non-Markovian robot manipulation, improving task success and lowering latency over standard VLA baselines.

  8. SketchVLM: Vision language models can annotate images to explain thoughts and guide users

    cs.CV 2026-04 unverdicted novelty 7.0

    SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.

  9. Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

    cs.CV 2026-04 conditional novelty 7.0

    Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.

  10. DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    DSCA turns concept isolation into an architectural property by dynamically creating orthogonal subspaces for non-interfering lifelong edits in vision-language models, sustaining over 95% success after 1000 sequential edits.

  11. MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining

    cs.LG 2026-04 unverdicted novelty 7.0

    MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faste...

  12. Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control

    cs.RO 2026-03 conditional novelty 7.0

    GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.

  13. AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

    cs.RO 2026-03 unverdicted novelty 7.0

    AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.

  14. GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

    cs.RO 2026-05 unverdicted novelty 6.0

    GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.

  15. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 conditional novelty 6.0

    Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.

  16. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 unverdicted novelty 6.0

    Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.

  17. Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT

    cs.CV 2026-05 unverdicted novelty 6.0

    Distills 3D spatial reasoning from a 7B teacher VLM to a 2.29B student using VGGT encoder, multi-task losses, and Hidden CoT latent tokens, yielding 8.7x lower latency with 54-72% performance retention on ScanNet and ...

  18. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.

  19. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.

  20. AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    AT-VLA introduces adaptive tactile injection and a dual-stream tactile reaction mechanism to integrate real-time tactile feedback into pretrained VLA models for contact-rich robotic manipulation.

  21. Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

    cs.AI 2026-05 unverdicted novelty 6.0

    Attention sharpness barely predicts VLM correctness while hidden-state probes and self-consistency strongly do, with late-fusion models showing fragile reliability bottlenecks unlike early-fusion ones.

  22. Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    Interventional attribution via ISS and NMR diagnoses causal misalignment in VLA policies and predicts their generalization performance across manipulation tasks.

  23. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

  24. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.

  25. MotuBrain: An Advanced World Action Model for Robot Control

    cs.RO 2026-04 unverdicted novelty 6.0

    MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new rob...

  26. Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training

    cs.RO 2026-04 unverdicted novelty 6.0

    DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...

  27. Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...

  28. ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.

  29. AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    AnchorRefine factorizes VLA action generation into a trajectory anchor for coarse planning and residual refinement for local corrections, improving success rates by up to 7.8% in simulation and 18% on real robots acro...

  30. ViLL-E: Video LLM Embeddings for Retrieval

    cs.CV 2026-04 unverdicted novelty 6.0

    ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.

  31. TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

    cs.CV 2026-04 unverdicted novelty 6.0

    TIPSv2 improves dense patch-text alignment in vision-language pretraining through distillation and iBOT++ modifications, yielding models on par with or better than recent baselines on 9 tasks across 20 datasets.

  32. AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement

    cs.RO 2026-04 unverdicted novelty 6.0

    AnySlot decouples language grounding from low-level control by inserting an explicit visual goal image, yielding better zero-shot performance on precise slot placement tasks than flat VLA policies.

  33. RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

    cs.RO 2026-04 unverdicted novelty 6.0

    RoboLab is a photorealistic simulation benchmark with 120 tasks and perturbation analysis to evaluate true generalization and robustness of robotic foundation models.

  34. VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

    cs.RO 2026-04 unverdicted novelty 6.0

    VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.

  35. CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.

  36. Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

    cs.RO 2026-04 conditional novelty 6.0

    MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.

  37. UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models

    cs.CV 2026-04 conditional novelty 6.0

    UAV-Track VLA modifies the π0.5 VLA architecture with temporal compression and dual-branch decoding to reach 61.76% success and 269.65 average frames in long-distance pedestrian tracking on a new 890K-frame UAV datase...

  38. mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

    cs.RO 2025-12 unverdicted novelty 6.0

    mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.

  39. Perception Encoder: The best visual embeddings are not at the output of the network

    cs.CV 2025-04 unverdicted novelty 6.0

    Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, a...

  40. FAST: Efficient Action Tokenization for Vision-Language-Action Models

    cs.RO 2025-01 unverdicted novelty 6.0

    FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...

  41. $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    cs.LG 2024-10 unverdicted novelty 6.0

    π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.

  42. Exploring and Exploiting Stability in Latent Flow Matching

    cs.LG 2026-05 unverdicted novelty 5.0

    Latent Flow Matching models exhibit inherent stability to data reduction and model shrinkage due to the flow matching objective, enabling reduced-dataset training and two-stage inference with over 2x speedup while pre...

  43. Let ViT Speak: Generative Language-Image Pre-training

    cs.CV 2026-05 unverdicted novelty 5.0

    GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.

  44. VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 5.0

    VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.

  45. The Amazing Stability of Flow Matching

    cs.CV 2026-04 unverdicted novelty 5.0

    Flow matching generative models preserve sample quality, diversity, and latent representations despite pruning 50% of the CelebA-HQ dataset or altering architecture and training configurations.

  46. AEGIS: Anchor-Enforced Gradient Isolation for Knowledge-Preserving Vision-Language-Action Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 5.0

    AEGIS uses a pre-computed Gaussian anchor and layer-wise Gram-Schmidt orthogonal projections to isolate destructive gradients during VLA fine-tuning, preserving VQA performance without co-training or replay.

  47. Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference

    cs.CV 2026-04 unverdicted novelty 5.0

    Dual-encoder VLMs gain robust compositional generalization by learning localized alignments from frozen patch and token embeddings instead of using global similarity.

  48. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

  49. MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    cs.CV 2024-08 conditional novelty 5.0

    MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

  50. ZAYA1-VL-8B Technical Report

    cs.CV 2026-05 unverdicted novelty 4.0

    ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...

  51. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.

  52. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.

  53. PLaMo 2.1-VL Technical Report

    cs.CV 2026-04 unverdicted novelty 4.0

    PLaMo 2.1-VL reports 61.5 ROUGE-L on JA-VG-VQA-500, 85.2% on Japanese Ref-L4, 53.9% zero-shot factory accuracy, and raises anomaly detection F1 from 39.7 to 64.9 after fine-tuning.

  54. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

  55. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    cs.CV 2025-02 unverdicted novelty 4.0

    SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...

  56. PaliGemma 2: A Family of Versatile VLMs for Transfer

    cs.CV 2024-12 unverdicted novelty 4.0

    PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at ...

Reference graph

Works this paper leans on

167 extracted references · 167 canonical work pages · cited by 50 Pith papers · 10 internal anchors

  1. [1]

    Acharya, K

    M. Acharya, K. Kafle, and C. Kanan. Tal- lyqa: Answering complex counting ques- tions. InAAAI, 2019

  2. [2]

    arXiv preprint arXiv:2201.07520 , year=

    A. Aghajanyan, B. Huang, C. Ross, V.Karpukhin,H.Xu,N.Goyal,D.Okhonko, M.Joshi,G.Ghosh,M.Lewis,andL.Zettle- moyer. CM3: A causal masked mul- timodal model of the internet. CoRR, abs/2201.07520, 2022. URL https:// arxiv.org/abs/2201.07520

  3. [3]

    Aghajanyan, L

    A. Aghajanyan, L. Yu, A. Conneau, W. Hsu, K. Hambardzumyan, S. Zhang, S. Roller, N. Goyal, O. Levy, and L. Zettle- moyer. Scaling laws for generative mixed-modal language models. In A. Krause, E. Brunskill, K. Cho, B. Engel- hardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawa...

  4. [4]

    Agrawal, K

    H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson. nocaps: novel ob- ject captioning at scale. InProceedings of the IEEE International Conference on Com- puter Vision, pages 8948–8957, 2019

  5. [5]

    Alabdulmohsin, X

    I. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. In NeurIPS, 2023

  6. [6]

    Flamingo: a Visual Language Model for Few-Shot Learning

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Men- sch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O.Vinyals, A.Zisserman, andK.Simonyan. Flamingo: a visual language model f...

  7. [7]

    R. Anil, S. Borgeaud, Y. Wu, J. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, D. Silver, S. Petrov, M. Johnson, I. Antonoglou, J. Schrit- twieser, A. Glaese, J. Chen, E. Pitler, T. P. Lillicrap, A. Lazaridou, O. Firat, J. Mol- loy, M. Isard, P. R. Barham, T. Hennigan, B. Lee, F. Viola, M. Reynolds, Y. Xu, R. Do- herty, E. Col...

  8. [9]

    E. K. M. S. H. H. A. F. Aniruddha Kemb- havi, Mike Salvato. A diagram is worth a dozen images. In European Confer- ence on Computer Vision (ECCV), 2016. URL https://api.semanticscholar. org/CorpusID:2682274

  9. [10]

    Baechler, S

    G. Baechler, S. Sunkara, M. Wang, F. Zubach, H. Mansoor, V. Etter, V. Căr- bune, J. Lin, J. Chen, and A. Sharma. Screenai: A vision-language model for ui and infographics understanding, 2024. URL https://arxiv.org/abs/2402. 04615

  10. [11]

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. CoRR, abs/2308.12966, 2023. doi: 10.48550/ ARXIV.2308.12966. URL https://doi. org/10.48550/arXiv.2308.12966

  11. [12]

    Odena, A

    R.Bavishi,E.Elsen,C.Hawthorne,M.Nye, A. Odena, A. Somani, and S. Taşır- lar. Introducing our multimodal models,

  12. [13]

    URL https://www.adept.ai/ blog/fuyu-8b

  13. [14]

    Beyer, X

    L. Beyer, X. Zhai, and A. Kolesnikov. Big vision. https://github.com/ google-research/big_vision, 2022

  14. [15]

    Beyer, P

    L. Beyer, P. Izmailov, A. Kolesnikov, M. Caron, S. Kornblith, X. Zhai, M. Min- derer, M. Tschannen, I. Alabdulmohsin, and F. Pavetic. Flexivit: One model for all patch sizes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14496–14506, 2023

  15. [16]

    S., Bugliarello, E., Wang, X., Yu, Q., Chen, L.-C., and Zhai, X

    L. Beyer, B. Wan, G. Madan, F. Pavetic, A. Steiner, A. Kolesnikov, A. S. Pinto, E. Bugliarello, X. Wang, Q. Yu, et al. A study of autoregressive decoders for multi- tasking in computer vision.arXiv preprint arXiv:2303.17376, 2023

  16. [17]

    A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, C. Jawahar, E. Valveny, and D. Karatzas. Scene text visual question an- swering. In2019 IEEE/CVF International Conference on Computer Vision (ICCV) . IEEE, Oct. 2019. doi: 10.1109/iccv.2019. 00439. URL http://dx.doi.org/10. 1109/ICCV.2019.00439

  17. [18]

    Bradbury, R

    J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson,C.Leary,D.Maclaurin,G.Necula, A. Paszke, J. VanderPlas, S. Wanderman- Milne, and Q. Zhang. JAX: compos- able transformations of Python+NumPy programs, 2018. URL http://github. com/google/jax

  18. [19]

    Bugliarello, F

    E. Bugliarello, F. Liu, J. Pfeiffer, S. Reddy, D. Elliott, E. M. Ponti, and I. Vuli’c. IGLUE: A benchmark for transfer learning across modalities, tasks, and languages. In Proceedings of the 39th International Conference on Machine Learning , vol- ume 162 of Proceedings of Machine Learning Research , page 2370–2392, Balitmore, MA, July 2022. PMLR. URL htt...

  19. [20]

    J. Cha, W. Kang, J. Mun, and B. Roh. Hon- eybee: Locality-enhanced projector for multimodal LLM.CoRR, abs/2312.06742, 15 PaliGemma: A versatile 3B VLM for transfer

  20. [21]

    doi: 10.48550/ARXIV.2312. 06742. URL https://doi.org/10. 48550/arXiv.2312.06742

  21. [22]

    Changpinyo, D

    S. Changpinyo, D. Kukliansy, I. Szpektor, X. Chen, N. Ding, and R. Soricut. All you may need for VQA are image cap- tions. In M. Carpuat, M.-C. de Marn- effe, and I. V. Meza Ruiz, editors, Pro- ceedings of the 2022 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Hu- man Language Technologies, pages 1947– 1963, ...

  22. [23]

    org/2022.naacl-main.142

    URL https://aclanthology. org/2022.naacl-main.142

  23. [24]

    D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase eval- uation. In D. Lin, Y. Matsumoto, and R. Mihalcea, editors, The 49th Annual Meeting of the Association for Computa- tional Linguistics: Human Language Tech- nologies, Proceedings of the Conference, 19- 24 June, 2011, Portland, Oregon, USA, pages 190–200. The Association for Co...

  24. [25]

    L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin. Sharegpt4v: Improving large multi-modal modelswithbettercaptions. arXivpreprint arXiv:2311.12793, 2023

  25. [26]

    T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. E. Hinton. Pix2seq: A language mod- eling framework for object detection. In ICLR, 2022

  26. [27]

    X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, A. Kolesnikov, J. Puigcerver, N. Ding, K. Rong, H. Akbari, G. Mishra, L. Xue, A. Thapliyal, J. Bradbury, W. Kuo, M. Seyedhosseini, C. Jia, B. K. Ayan, C. Riquelme, A. Steiner, A. Angelova, X. Zhai, N. Houlsby, and R. Soricut. PaLI: A j...

  27. [28]

    X. Chen, J. Djolonga, P. Padlewski, B. Mustafa, S. Changpinyo, J. Wu, C. R. Ruiz,S.Goodman,X.Wang,Y.Tay,S.Shak- eri, M. Dehghani, D. Salz, M. Lucic, M.Tschannen, A.Nagrani, H.Hu, M.Joshi, B. Pang, C. Montgomery, P. Pietrzyk, M. Ritter, A. J. Piergiovanni, M. Minderer, F. Pavetic, A. Waters, G. Li, I. Alabdul- mohsin, L. Beyer, J. Amelot, K. Lee, A. P. Ste...

  28. [29]

    X. Chen, X. Wang, L. Beyer, A. Kolesnikov, J.Wu, P.Voigtlaender, B.Mustafa, S.Good- man, I. Alabdulmohsin, P. Padlewski, D. Salz, X. Xiong, D. Vlasic, F. Pavetic, K. Rong, T. Yu, D. Keysers, X. Zhai, and R. Soricut. PaLI-3 vision language mod- els: Smaller, faster, stronger. CoRR, arXiv:2310.09199, 2023

  29. [30]

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P. Luo, T. Lu, Y. Qiao, and J. Dai. Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023

  30. [31]

    J. Cho, J. Lei, H. Tan, and M. Bansal. Uni- fying vision-and-language tasks via text generation. In M. Meila and T. Zhang, ed- itors, Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Re- search, pages 1931–1942. PMLR, 18–24 Jul 2021. URLhttps://proceedings. mlr.press/v139/cho21a.html

  31. [32]

    Chowdhery, S

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sut- ton, S. Gehrmann, P. Schuh, K. Shi, 16 PaliGemma: A versatile 3B VLM for transfer S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prab- hakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P....

  32. [33]

    Dehghani, J

    M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos, I. Al- abdulmohsin, R. Jenatton, L. Beyer, M. Tschannen, A. Arnab, X. Wang, C. R. Ruiz, M. Minderer, J. Puigcerver, U. Evci, M. Kumar, S. van Steenkiste, G. F. El- sayed, A. Mahendran, F. Yu, A. Oliver, F. Huot, J. Bastings, M. Collier, A. A. Grits...

  33. [34]

    Dehghani, B

    M. Dehghani, B. Mustafa, J. Djolonga, J. Heek, M. Minderer, M. Caron, A. Steiner, J. Puigcerver, R. Geirhos, I. M. Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and res- olution. Advances in Neural Information Processing Systems, 36, 2024

  34. [35]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR)

    M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of anno- tated 3d objects. InIEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 13142–13153. IEEE, 2023. doi: 10.1109/CVPR52729. 2023.0126...

  35. [36]

    Desai and J

    K. Desai and J. Johnson. Virtex: Learning visual representations from textual anno- tations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11162–11173, 2021

  36. [37]

    BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for lan- guage understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Pro- ceedings of the 2019 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Hu- man Language Technologies, NAACL-HLT 2019, Minneapol...

  37. [38]

    H. Diao, Y. Cui, X. Li, Y. Wang, H. Lu, and X. Wang. Unveiling encoder-free vision- language models, 2024. URLhttps:// arxiv.org/abs/2406.11832

  38. [39]

    X. Dong, P. Zhang, Y. Zang, Y. Cao, B. Wang, L. Ouyang, S. Zhang, H. Duan, W. Zhang, Y. Li, H. Yan, Y. Gao, Z. Chen, X. Zhang, W. Li, J. Li, W. Wang, K. Chen, C. He, X. Zhang, J. Dai, Y. Qiao, D. Lin, and J. Wang. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd, 2024. URL https://arxiv.org/ ...

  39. [40]

    Driess, F

    D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A.Chowdhery, B.Ichter, A.Wahid, J.Tomp- son, Q. Vuong, T. Yu, W. Huang, Y. Chebo- tar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Flo- rence. PaLM-E: An embodied multimodal language model. In A. Krause, E. Brun- skill, K. Cho, B. Engelh...

  40. [41]

    mlr.press/v202/driess23a.html

    URL https://proceedings. mlr.press/v202/driess23a.html

  41. [42]

    Geigle, R

    G. Geigle, R. Timofte, and G. Glavaš. African or european swallow? bench- marking large vision-language models for fine-grained object classification, 2024. URL https://arxiv.org/abs/2406. 14496

  42. [43]

    Introduction to Cloud TPU

    Google Cloud. Introduction to Cloud TPU. https://cloud.google.com/ tpu/docs/intro-to-tpu, 20xx. Ac- cessed: 2024-07-04

  43. [44]

    Goyal, T

    Y. Goyal, T. Khot, D. Summers-Stay, D. Ba- tra, and D. Parikh. Making the V in VQA matter: Elevating the role of image under- standing in Visual Question Answering. In Computer Vision and Pattern Recognition (CVPR), 2017

  44. [45]

    Grauman, J

    D.Gurari, Q.Li, A.J.Stangl, A.Guo, C.Lin, K. Grauman, J. Luo, and J. P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceed- ings of the IEEE conference on computer vi- sion and pattern recognition, pages 3608– 3617, 2018

  45. [46]

    M. He, Y. Liu, B. Wu, J. Yuan, Y. Wang, T. Huang, and B. Zhao. Efficient multimodal learning from data-centric perspective. CoRR, abs/2402.11530,

  46. [47]

    doi: 10.48550/ARXIV.2402. 11530. URL https://doi.org/10. 48550/arXiv.2402.11530

  47. [48]

    J. Hewitt. Initializing new word em- beddings for pretrained language models. https:/nlp.stanford.edu/ ~johnhew//vocab-expansion.html, 2021

  48. [49]

    Hsieh, J

    C.-Y. Hsieh, J. Zhang, Z. Ma, A. Kembhavi, and R. Krishna. Sugarcrepe: fixing hack- able benchmarks for vision-language com- positionality. In Proceedings of the 37th International Conference on Neural Infor- mation Processing Systems, NIPS ’23, Red Hook, NY, USA, 2024. Curran Associates Inc

  49. [50]

    T.-Y. Hsu, C. L. Giles, and T.-H. Huang. Scicap: Generating captions for scientific figures. arXiv preprint arXiv:2110.11624, 2021

  50. [51]

    Huang, L

    S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, T. Lv, L. Cui, O. K. Mohammed, B. Patra, Q. Liu, K. Aggar- wal, Z. Chi, N. J. B. Bjorck, V. Chaudhary, S. Som, X. Song, and F. Wei. Language is not all you need: Aligning perception with language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural...

  51. [52]

    Huang, Z

    Y. Huang, Z. Meng, F. Liu, Y. Su, N. Col- lier, and Y. Lu. Sparkles: Unlock- ing chats across multiple images for multimodal instruction-following models,

  52. [53]

    URLhttps://openreview.net/ forum?id=oq5EF8parZ

  53. [54]

    Hudson and Christopher D

    D. Hudson and C. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. Computer Vision and Pattern Recogni- tion (CVPR) , abs/1902.09506, 2019. doi: 10.48550/arXiv.1902.09506. URL https://doi.org/10.48550/arXiv. 1902.09506

  54. [55]

    Jaegle, F

    A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira. Perceiver: 18 PaliGemma: A versatile 3B VLM for transfer General perception with iterative atten- tion. In M. Meila and T. Zhang, edi- tors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, vol- ume 139 ofProceedings o...

  55. [56]

    press/v139/jaegle21a.html

    URL http://proceedings.mlr. press/v139/jaegle21a.html

  56. [57]

    C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learn- ing with noisy text supervision. In M. Meila and T. Zhang, editors, Pro- ceedings of the 38th International Confer- ence on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Pro...

  57. [58]

    press/v139/jia21b.html

    URL http://proceedings.mlr. press/v139/jia21b.html

  58. [59]

    Kabra, L

    R. Kabra, L. Matthey, A. Lerchner, and N. J. Mitra. Leveraging vlm-based pipelines to annotate 3d objects. InProceedings of the 41st International Conference on Machine Learning, volume235of ProceedingsofMa- chine Learning Research. PMLR, 2024

  59. [60]

    Kajić, O

    I. Kajić, O. Wiles, I. Albuquerque, M. Bauer, S. Wang, J. Pont-Tuset, and A. Ne- matzadeh. Evaluating numerical reason- ing in text-to-image models, 2024

  60. [61]

    Karamcheti, S

    S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh. Pris- matic vlms: Investigating the design space of visually-conditioned language models,

  61. [62]

    URL https://arxiv.org/abs/ 2402.07865

  62. [63]

    R efer I t G ame: Referring to objects in photographs of natural scenes

    S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. ReferItGame: Referring to objects in photographs of natural scenes. In A. Moschitti, B. Pang, and W. Daele- mans, editors, Proceedings of the 2014 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 787–798, Doha, Qatar, Oct. 2014. Associ- ation for Computational Linguistics. d...

  63. [64]

    Kirillov, E

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. White- head, A. C. Berg, W.-Y. Lo, P. Dollar, and R. Girshick. Segment anything. In Pro- ceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), pages 4015–4026, October 2023

  64. [65]

    Kolesnikov, L

    A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby. Big transfer (BiT): General vi- sual representation learning. InEuropean Conference on Computer Vision (ECCV), 2020

  65. [66]

    Kolesnikov, A

    A. Kolesnikov, A. S. Pinto, L. Beyer, X. Zhai, J. Harmsen, and N. Houlsby. Uvim: A unified modeling approach for vision with learned guiding codes. In S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural In- formation Processing Systems 35: Annual Conference on Neural Information Process- ing Systems 2022, NeurIP...

  66. [67]

    Krishna, K

    R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE in- ternational conference on computer vision, pages 706–715, 2017

  67. [68]

    SentencePiece:

    T. Kudo and J. Richardson. SentencePiece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing. In E. Blanco and W. Lu, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, Nov.2018.Asso- ciation for Computational ...

  68. [69]

    Laurençon, L

    H. Laurençon, L. Tronchon, M. Cord, and V. Sanh. What matters when build- ing vision-language models? CoRR, abs/2405.02246, 2024. doi: 10.48550/ 19 PaliGemma: A versatile 3B VLM for transfer ARXIV.2405.02246. URL https://doi. org/10.48550/arXiv.2405.02246

  69. [70]

    Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh

    H. Laurençon, L. Saulnier, L. Tronchon, S. Bekman, A. Singh, A. Lozhkov, T. Wang, S. Karamcheti, A. M. Rush, D. Kiela, M. Cord, and V. Sanh. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023. URLhttps: //arxiv.org/abs/2306.16527

  70. [71]

    J. Li, D. Li, C. Xiong, and S. C. H. Hoi. BLIP: bootstrapping language- image pre-training for unified vision- language understanding and generation. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, editors, International Conference on Ma- chine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA , vol- ume 162 ofProc...

  71. [72]

    mlr.press/v162/li22n.html

    URL https://proceedings. mlr.press/v162/li22n.html

  72. [73]

    J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language- image pre-training with frozen image en- coders and large language models. In A. Krause, E. Brunskill, K. Cho, B. En- gelhardt, S. Sabato, and J. Scarlett, ed- itors, International Conference on Ma- chine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of...

  73. [74]

    mlr.press/v202/li23q.html

    URL https://proceedings. mlr.press/v202/li23q.html

  74. [75]

    Y. Li, G. Li, L. He, J. Zheng, H. Li, and Z. Guan. Widget captioning: Generat- ing natural language description for mo- bileuser interface elements. InConference on Empirical Methods in Natural Language Processing, 2020

  75. [76]

    Y. Li, H. Mao, R. Girshick, and K. He. Exploring plain vision transformer backbones for object detection, 2022. URL https: //arxiv.org/abs/2203.16527

  76. [77]

    Y.Li,Y.Zhang,C.Wang,Z.Zhong,Y.Chen, R. Chu, S. Liu, and J. Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024

  77. [78]

    B. Lin, Z. Tang, Y. Ye, J. Cui, B. Zhu, P. Jin, J. Huang, J. Zhang, Y. Pang, M. Ning, and L. Yuan. Moe-llava: Mixture of ex- perts for large vision-language models,

  78. [79]

    URL https://arxiv.org/abs/ 2401.15947

  79. [80]

    T. Lin, M. Maire, S. J. Belongie, L. D. Bour- dev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll’a r, and C. L. Zitnick. Microsoft COCO: common objects in con- text. CoRR, abs/1405.0312, 2014. URL http://arxiv.org/abs/1405.0312

  80. [81]

    F. Liu, E. Bugliarello, E. M. Ponti, S. Reddy, N. Collier, and D. Elliott. Visually grounded reasoning across languages and cultures. In M.-F. Moens, X. Huang, L. Spe- cia, and S. W.-t. Yih, editors, Proceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467–10485, Online and Punta Cana, Dominican Republic, Nov. ...

Showing first 80 references.