Recognition: 3 theorem links
· Lean TheoremPaliGemma: A versatile 3B VLM for transfer
Pith reviewed 2026-05-11 13:03 UTC · model grok-4.3
The pith
PaliGemma fuses a SigLIP vision encoder with the Gemma-2B language model to produce a 3B open VLM that transfers effectively to nearly 40 tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PaliGemma is a 3-billion-parameter vision-language model that combines the SigLIP-So400m vision encoder with the Gemma-2B language model. When trained with a recipe intended to create a broadly knowledgeable base, it achieves strong performance on almost 40 diverse open-world tasks without requiring task-specific architectures.
What carries the argument
The integration of the SigLIP vision encoder and Gemma language model under a training procedure that aims to produce transferable multimodal capabilities.
Load-bearing premise
The training procedure applied to these two components produces a model whose effectiveness on the reported tasks holds without undisclosed choices in task selection or evaluation details.
What would settle it
An independent training run of the identical architecture and data recipe that measures accuracy on the same set of tasks and finds performance that deviates substantially from the published numbers.
read the original abstract
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PaliGemma, an open 3B-parameter Vision-Language Model combining the SigLIP-So400m vision encoder with the Gemma-2B language model. It is positioned as a versatile base model trained for broad knowledge and effective transfer, with claims of strong performance across nearly 40 diverse tasks that include standard VLM benchmarks as well as specialized domains such as remote-sensing and segmentation.
Significance. If the transfer-effectiveness claims hold under rigorous evaluation, the release of this open VLM could provide a practical starting point for multimodal research, reducing the need for task-specific pretraining from scratch and extending utility to non-standard domains like remote sensing.
major comments (3)
- [Abstract] Abstract: the central claim of 'strong performance' and 'versatile' transfer on ~40 tasks is asserted without any quantitative results, baselines, error bars, or even a high-level summary of metrics; this absence is load-bearing because the generalization assertion cannot be assessed from the given text.
- [Evaluation] Evaluation section (implied by the task count): no description is provided of the data mixture, training objective, per-task prompt templates, or metric implementations, leaving the weakest assumption (that the SigLIP+Gemma recipe yields generalizable transfer without post-hoc selection) untestable.
- [Methods] Methods: the precise recipe for combining SigLIP-So400m and Gemma-2B (including any continued pretraining or alignment stages) is unspecified, which directly affects reproducibility and the claim that this particular combination is broadly effective.
minor comments (2)
- [Abstract] The model is described as '3B' while the components are listed as So400m + 2B; a brief clarification of total parameter count or whether the figure is approximate would avoid confusion.
- A summary table listing the ~40 tasks with primary metrics and comparisons to prior open VLMs would substantially improve readability and allow readers to verify the breadth of the evaluation.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback. We address each major comment below, providing clarifications from the manuscript and indicating revisions that will strengthen the presentation of our claims, methods, and evaluations.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of 'strong performance' and 'versatile' transfer on ~40 tasks is asserted without any quantitative results, baselines, error bars, or even a high-level summary of metrics; this absence is load-bearing because the generalization assertion cannot be assessed from the given text.
Authors: We agree that the abstract would benefit from a concise quantitative anchor to make the 'strong performance' and 'versatile' claims more immediately assessable. The full manuscript contains extensive tables and figures with per-task metrics, baselines, and comparisons, but these are not summarized at the abstract level. In the revision we will add one or two sentences providing a high-level overview (e.g., number of tasks where PaliGemma matches or exceeds prior open models, reference to the main result tables) while preserving the abstract's brevity. revision: yes
-
Referee: [Evaluation] Evaluation section (implied by the task count): no description is provided of the data mixture, training objective, per-task prompt templates, or metric implementations, leaving the weakest assumption (that the SigLIP+Gemma recipe yields generalizable transfer without post-hoc selection) untestable.
Authors: The manuscript does contain a training and evaluation section that specifies the overall data mixture, the multimodal training objective, and the evaluation protocol. However, we acknowledge that the level of detail on per-task prompt templates and exact metric implementations could be more explicit to allow readers to verify the absence of post-hoc selection. We will expand this section with additional specifics on the data composition, objective formulation, representative prompt templates, and metric definitions for the ~40 tasks. revision: yes
-
Referee: [Methods] Methods: the precise recipe for combining SigLIP-So400m and Gemma-2B (including any continued pretraining or alignment stages) is unspecified, which directly affects reproducibility and the claim that this particular combination is broadly effective.
Authors: The methods section describes the architectural integration of the SigLIP-So400m encoder with the Gemma-2B decoder, the joint training procedure, and the stages used to produce the final model. To improve reproducibility we will add more granular details on the exact combination mechanism, any continued pretraining or alignment phases, hyper-parameters, and initialization choices. This will make the recipe fully specified while preserving the core claim that the SigLIP+Gemma pairing enables broad transfer. revision: yes
Circularity Check
No circularity: empirical model release with no derivation chain
full rationale
The paper presents PaliGemma as an open VLM combining the SigLIP-So400m vision encoder and Gemma-2B language model, trained for versatile transfer and evaluated on nearly 40 tasks. No equations, first-principles derivations, predictions, or fitted parameters are introduced that could reduce to inputs by construction. Claims rest entirely on external benchmark results rather than internal self-definitions or self-citation chains. This is a standard empirical model release whose central assertions are externally falsifiable and independent of any circular reduction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PaliGemma consists of three components: An image encoder... A decoder-only language model... A linear layer projecting SigLIP’s output tokens into the same dimensions as Gemma-2B’s vocab tokens
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 54 Pith papers
-
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
-
Dynamic Execution Commitment of Vision-Language-Action Models
A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.
-
RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning
RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
-
Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models
MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.
-
CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models
CodeGraphVLP uses a semantic-graph state and executable code planner to enable reliable long-horizon non-Markovian robot manipulation, improving task success and lowering latency over standard VLA baselines.
-
SketchVLM: Vision language models can annotate images to explain thoughts and guide users
SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
-
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
-
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
DSCA turns concept isolation into an architectural property by dynamically creating orthogonal subspaces for non-interfering lifelong edits in vision-language models, sustaining over 95% success after 1000 sequential edits.
-
MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faste...
-
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
-
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
-
Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT
Distills 3D spatial reasoning from a 7B teacher VLM to a 2.29B student using VGGT encoder, multi-task losses, and Hidden CoT latent tokens, yielding 8.7x lower latency with 54-72% performance retention on ScanNet and ...
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
-
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
AT-VLA introduces adaptive tactile injection and a dual-stream tactile reaction mechanism to integrate real-time tactile feedback into pretrained VLA models for contact-rich robotic manipulation.
-
Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits
Attention sharpness barely predicts VLM correctness while hidden-state probes and self-consistency strongly do, with late-fusion models showing fragile reliability bottlenecks unlike early-fusion ones.
-
Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models
Interventional attribution via ISS and NMR diagnoses causal misalignment in VLA policies and predicts their generalization performance across manipulation tasks.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
-
MotuBrain: An Advanced World Action Model for Robot Control
MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new rob...
-
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
-
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...
-
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation
ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
-
AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models
AnchorRefine factorizes VLA action generation into a trajectory anchor for coarse planning and residual refinement for local corrections, improving success rates by up to 7.8% in simulation and 18% on real robots acro...
-
ViLL-E: Video LLM Embeddings for Retrieval
ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
-
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
TIPSv2 improves dense patch-text alignment in vision-language pretraining through distillation and iBOT++ modifications, yielding models on par with or better than recent baselines on 9 tasks across 20 datasets.
-
AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement
AnySlot decouples language grounding from low-level control by inserting an explicit visual goal image, yielding better zero-shot performance on precise slot placement tasks than flat VLA policies.
-
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
RoboLab is a photorealistic simulation benchmark with 120 tasks and perturbation analysis to evaluate true generalization and robustness of robotic foundation models.
-
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
-
UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models
UAV-Track VLA modifies the π0.5 VLA architecture with temporal compression and dual-branch decoding to reach 61.76% success and 269.65 average frames in long-distance pedestrian tracking on a new 890K-frame UAV datase...
-
Perception Encoder: The best visual embeddings are not at the output of the network
Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, a...
-
FAST: Efficient Action Tokenization for Vision-Language-Action Models
FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...
-
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
-
Exploring and Exploiting Stability in Latent Flow Matching
Latent Flow Matching models exhibit inherent stability to data reduction and model shrinkage due to the flow matching objective, enabling reduced-dataset training and two-stage inference with over 2x speedup while pre...
-
Let ViT Speak: Generative Language-Image Pre-training
GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
-
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
-
The Amazing Stability of Flow Matching
Flow matching generative models preserve sample quality, diversity, and latent representations despite pruning 50% of the CelebA-HQ dataset or altering architecture and training configurations.
-
AEGIS: Anchor-Enforced Gradient Isolation for Knowledge-Preserving Vision-Language-Action Fine-Tuning
AEGIS uses a pre-computed Gaussian anchor and layer-wise Gram-Schmidt orthogonal projections to isolate destructive gradients during VLA fine-tuning, preserving VQA performance without co-training or replay.
-
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Dual-encoder VLMs gain robust compositional generalization by learning localized alignments from frozen patch and token embeddings instead of using global similarity.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
-
ZAYA1-VL-8B Technical Report
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
-
RLDX-1 Technical Report
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
-
RLDX-1 Technical Report
RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.
-
PLaMo 2.1-VL Technical Report
PLaMo 2.1-VL reports 61.5 ROUGE-L on JA-VG-VQA-500, 85.2% on Japanese Ref-L4, 53.9% zero-shot factory accuracy, and raises anomaly detection F1 from 39.7 to 64.9 after fine-tuning.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...
-
PaliGemma 2: A Family of Versatile VLMs for Transfer
PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at ...
Reference graph
Works this paper leans on
-
[1]
M. Acharya, K. Kafle, and C. Kanan. Tal- lyqa: Answering complex counting ques- tions. InAAAI, 2019
work page 2019
-
[2]
arXiv preprint arXiv:2201.07520 , year=
A. Aghajanyan, B. Huang, C. Ross, V.Karpukhin,H.Xu,N.Goyal,D.Okhonko, M.Joshi,G.Ghosh,M.Lewis,andL.Zettle- moyer. CM3: A causal masked mul- timodal model of the internet. CoRR, abs/2201.07520, 2022. URL https:// arxiv.org/abs/2201.07520
-
[3]
A. Aghajanyan, L. Yu, A. Conneau, W. Hsu, K. Hambardzumyan, S. Zhang, S. Roller, N. Goyal, O. Levy, and L. Zettle- moyer. Scaling laws for generative mixed-modal language models. In A. Krause, E. Brunskill, K. Cho, B. Engel- hardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawa...
work page 2023
-
[4]
H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson. nocaps: novel ob- ject captioning at scale. InProceedings of the IEEE International Conference on Com- puter Vision, pages 8948–8957, 2019
work page 2019
-
[5]
I. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. In NeurIPS, 2023
work page 2023
-
[6]
Flamingo: a Visual Language Model for Few-Shot Learning
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Men- sch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O.Vinyals, A.Zisserman, andK.Simonyan. Flamingo: a visual language model f...
work page internal anchor Pith review arXiv 2022
-
[7]
R. Anil, S. Borgeaud, Y. Wu, J. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, D. Silver, S. Petrov, M. Johnson, I. Antonoglou, J. Schrit- twieser, A. Glaese, J. Chen, E. Pitler, T. P. Lillicrap, A. Lazaridou, O. Firat, J. Mol- loy, M. Isard, P. R. Barham, T. Hennigan, B. Lee, F. Viola, M. Reynolds, Y. Xu, R. Do- herty, E. Col...
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
E. K. M. S. H. H. A. F. Aniruddha Kemb- havi, Mike Salvato. A diagram is worth a dozen images. In European Confer- ence on Computer Vision (ECCV), 2016. URL https://api.semanticscholar. org/CorpusID:2682274
work page 2016
-
[10]
G. Baechler, S. Sunkara, M. Wang, F. Zubach, H. Mansoor, V. Etter, V. Căr- bune, J. Lin, J. Chen, and A. Sharma. Screenai: A vision-language model for ui and infographics understanding, 2024. URL https://arxiv.org/abs/2402. 04615
work page 2024
-
[11]
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. CoRR, abs/2308.12966, 2023. doi: 10.48550/ ARXIV.2308.12966. URL https://doi. org/10.48550/arXiv.2308.12966
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.12966 2023
- [12]
-
[13]
URL https://www.adept.ai/ blog/fuyu-8b
- [14]
-
[15]
L. Beyer, P. Izmailov, A. Kolesnikov, M. Caron, S. Kornblith, X. Zhai, M. Min- derer, M. Tschannen, I. Alabdulmohsin, and F. Pavetic. Flexivit: One model for all patch sizes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14496–14506, 2023
work page 2023
-
[16]
S., Bugliarello, E., Wang, X., Yu, Q., Chen, L.-C., and Zhai, X
L. Beyer, B. Wan, G. Madan, F. Pavetic, A. Steiner, A. Kolesnikov, A. S. Pinto, E. Bugliarello, X. Wang, Q. Yu, et al. A study of autoregressive decoders for multi- tasking in computer vision.arXiv preprint arXiv:2303.17376, 2023
-
[17]
A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, C. Jawahar, E. Valveny, and D. Karatzas. Scene text visual question an- swering. In2019 IEEE/CVF International Conference on Computer Vision (ICCV) . IEEE, Oct. 2019. doi: 10.1109/iccv.2019. 00439. URL http://dx.doi.org/10. 1109/ICCV.2019.00439
-
[18]
J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson,C.Leary,D.Maclaurin,G.Necula, A. Paszke, J. VanderPlas, S. Wanderman- Milne, and Q. Zhang. JAX: compos- able transformations of Python+NumPy programs, 2018. URL http://github. com/google/jax
work page 2018
-
[19]
E. Bugliarello, F. Liu, J. Pfeiffer, S. Reddy, D. Elliott, E. M. Ponti, and I. Vuli’c. IGLUE: A benchmark for transfer learning across modalities, tasks, and languages. In Proceedings of the 39th International Conference on Machine Learning , vol- ume 162 of Proceedings of Machine Learning Research , page 2370–2392, Balitmore, MA, July 2022. PMLR. URL htt...
work page 2022
- [20]
-
[21]
doi: 10.48550/ARXIV.2312. 06742. URL https://doi.org/10. 48550/arXiv.2312.06742
-
[22]
S. Changpinyo, D. Kukliansy, I. Szpektor, X. Chen, N. Ding, and R. Soricut. All you may need for VQA are image cap- tions. In M. Carpuat, M.-C. de Marn- effe, and I. V. Meza Ruiz, editors, Pro- ceedings of the 2022 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Hu- man Language Technologies, pages 1947– 1963, ...
- [23]
-
[24]
D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase eval- uation. In D. Lin, Y. Matsumoto, and R. Mihalcea, editors, The 49th Annual Meeting of the Association for Computa- tional Linguistics: Human Language Tech- nologies, Proceedings of the Conference, 19- 24 June, 2011, Portland, Oregon, USA, pages 190–200. The Association for Co...
work page 2011
-
[25]
L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin. Sharegpt4v: Improving large multi-modal modelswithbettercaptions. arXivpreprint arXiv:2311.12793, 2023
work page internal anchor Pith review arXiv 2023
-
[26]
T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. E. Hinton. Pix2seq: A language mod- eling framework for object detection. In ICLR, 2022
work page 2022
-
[27]
X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, A. Kolesnikov, J. Puigcerver, N. Ding, K. Rong, H. Akbari, G. Mishra, L. Xue, A. Thapliyal, J. Bradbury, W. Kuo, M. Seyedhosseini, C. Jia, B. K. Ayan, C. Riquelme, A. Steiner, A. Angelova, X. Zhai, N. Houlsby, and R. Soricut. PaLI: A j...
-
[28]
X. Chen, J. Djolonga, P. Padlewski, B. Mustafa, S. Changpinyo, J. Wu, C. R. Ruiz,S.Goodman,X.Wang,Y.Tay,S.Shak- eri, M. Dehghani, D. Salz, M. Lucic, M.Tschannen, A.Nagrani, H.Hu, M.Joshi, B. Pang, C. Montgomery, P. Pietrzyk, M. Ritter, A. J. Piergiovanni, M. Minderer, F. Pavetic, A. Waters, G. Li, I. Alabdul- mohsin, L. Beyer, J. Amelot, K. Lee, A. P. Ste...
-
[29]
X. Chen, X. Wang, L. Beyer, A. Kolesnikov, J.Wu, P.Voigtlaender, B.Mustafa, S.Good- man, I. Alabdulmohsin, P. Padlewski, D. Salz, X. Xiong, D. Vlasic, F. Pavetic, K. Rong, T. Yu, D. Keysers, X. Zhai, and R. Soricut. PaLI-3 vision language mod- els: Smaller, faster, stronger. CoRR, arXiv:2310.09199, 2023
-
[30]
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P. Luo, T. Lu, Y. Qiao, and J. Dai. Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023
work page internal anchor Pith review arXiv 2023
-
[31]
J. Cho, J. Lei, H. Tan, and M. Bansal. Uni- fying vision-and-language tasks via text generation. In M. Meila and T. Zhang, ed- itors, Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Re- search, pages 1931–1942. PMLR, 18–24 Jul 2021. URLhttps://proceedings. mlr.press/v139/cho21a.html
work page 1931
-
[32]
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sut- ton, S. Gehrmann, P. Schuh, K. Shi, 16 PaliGemma: A versatile 3B VLM for transfer S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prab- hakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P....
work page 2023
-
[33]
M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos, I. Al- abdulmohsin, R. Jenatton, L. Beyer, M. Tschannen, A. Arnab, X. Wang, C. R. Ruiz, M. Minderer, J. Puigcerver, U. Evci, M. Kumar, S. van Steenkiste, G. F. El- sayed, A. Mahendran, F. Yu, A. Oliver, F. Huot, J. Bastings, M. Collier, A. A. Grits...
work page 2023
-
[34]
M. Dehghani, B. Mustafa, J. Djolonga, J. Heek, M. Minderer, M. Caron, A. Steiner, J. Puigcerver, R. Geirhos, I. M. Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and res- olution. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[35]
In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR)
M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of anno- tated 3d objects. InIEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 13142–13153. IEEE, 2023. doi: 10.1109/CVPR52729. 2023.0126...
-
[36]
K. Desai and J. Johnson. Virtex: Learning visual representations from textual anno- tations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11162–11173, 2021
work page 2021
-
[37]
BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for lan- guage understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Pro- ceedings of the 2019 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Hu- man Language Technologies, NAACL-HLT 2019, Minneapol...
- [38]
-
[39]
X. Dong, P. Zhang, Y. Zang, Y. Cao, B. Wang, L. Ouyang, S. Zhang, H. Duan, W. Zhang, Y. Li, H. Yan, Y. Gao, Z. Chen, X. Zhang, W. Li, J. Li, W. Wang, K. Chen, C. He, X. Zhang, J. Dai, Y. Qiao, D. Lin, and J. Wang. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd, 2024. URL https://arxiv.org/ ...
-
[40]
D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A.Chowdhery, B.Ichter, A.Wahid, J.Tomp- son, Q. Vuong, T. Yu, W. Huang, Y. Chebo- tar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Flo- rence. PaLM-E: An embodied multimodal language model. In A. Krause, E. Brun- skill, K. Cho, B. Engelh...
work page 2023
- [41]
- [42]
-
[43]
Google Cloud. Introduction to Cloud TPU. https://cloud.google.com/ tpu/docs/intro-to-tpu, 20xx. Ac- cessed: 2024-07-04
work page 2024
- [44]
-
[45]
D.Gurari, Q.Li, A.J.Stangl, A.Guo, C.Lin, K. Grauman, J. Luo, and J. P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceed- ings of the IEEE conference on computer vi- sion and pattern recognition, pages 3608– 3617, 2018
work page 2018
- [46]
-
[47]
doi: 10.48550/ARXIV.2402. 11530. URL https://doi.org/10. 48550/arXiv.2402.11530
-
[48]
J. Hewitt. Initializing new word em- beddings for pretrained language models. https:/nlp.stanford.edu/ ~johnhew//vocab-expansion.html, 2021
work page 2021
-
[49]
C.-Y. Hsieh, J. Zhang, Z. Ma, A. Kembhavi, and R. Krishna. Sugarcrepe: fixing hack- able benchmarks for vision-language com- positionality. In Proceedings of the 37th International Conference on Neural Infor- mation Processing Systems, NIPS ’23, Red Hook, NY, USA, 2024. Curran Associates Inc
work page 2024
- [50]
-
[51]
S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, T. Lv, L. Cui, O. K. Mohammed, B. Patra, Q. Liu, K. Aggar- wal, Z. Chi, N. J. B. Bjorck, V. Chaudhary, S. Som, X. Song, and F. Wei. Language is not all you need: Aligning perception with language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural...
work page 2023
- [52]
-
[53]
URLhttps://openreview.net/ forum?id=oq5EF8parZ
-
[54]
D. Hudson and C. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. Computer Vision and Pattern Recogni- tion (CVPR) , abs/1902.09506, 2019. doi: 10.48550/arXiv.1902.09506. URL https://doi.org/10.48550/arXiv. 1902.09506
-
[55]
A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira. Perceiver: 18 PaliGemma: A versatile 3B VLM for transfer General perception with iterative atten- tion. In M. Meila and T. Zhang, edi- tors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, vol- ume 139 ofProceedings o...
work page 2021
- [56]
-
[57]
C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learn- ing with noisy text supervision. In M. Meila and T. Zhang, editors, Pro- ceedings of the 38th International Confer- ence on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Pro...
work page 2021
- [58]
- [59]
- [60]
-
[61]
S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh. Pris- matic vlms: Investigating the design space of visually-conditioned language models,
- [62]
-
[63]
R efer I t G ame: Referring to objects in photographs of natural scenes
S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. ReferItGame: Referring to objects in photographs of natural scenes. In A. Moschitti, B. Pang, and W. Daele- mans, editors, Proceedings of the 2014 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 787–798, Doha, Qatar, Oct. 2014. Associ- ation for Computational Linguistics. d...
-
[64]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. White- head, A. C. Berg, W.-Y. Lo, P. Dollar, and R. Girshick. Segment anything. In Pro- ceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), pages 4015–4026, October 2023
work page 2023
-
[65]
A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby. Big transfer (BiT): General vi- sual representation learning. InEuropean Conference on Computer Vision (ECCV), 2020
work page 2020
-
[66]
A. Kolesnikov, A. S. Pinto, L. Beyer, X. Zhai, J. Harmsen, and N. Houlsby. Uvim: A unified modeling approach for vision with learned guiding codes. In S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural In- formation Processing Systems 35: Annual Conference on Neural Information Process- ing Systems 2022, NeurIP...
work page 2022
-
[67]
R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE in- ternational conference on computer vision, pages 706–715, 2017
work page 2017
-
[68]
T. Kudo and J. Richardson. SentencePiece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing. In E. Blanco and W. Lu, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, Nov.2018.Asso- ciation for Computational ...
work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
-
[69]
H. Laurençon, L. Tronchon, M. Cord, and V. Sanh. What matters when build- ing vision-language models? CoRR, abs/2405.02246, 2024. doi: 10.48550/ 19 PaliGemma: A versatile 3B VLM for transfer ARXIV.2405.02246. URL https://doi. org/10.48550/arXiv.2405.02246
-
[70]
Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh
H. Laurençon, L. Saulnier, L. Tronchon, S. Bekman, A. Singh, A. Lozhkov, T. Wang, S. Karamcheti, A. M. Rush, D. Kiela, M. Cord, and V. Sanh. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023. URLhttps: //arxiv.org/abs/2306.16527
-
[71]
J. Li, D. Li, C. Xiong, and S. C. H. Hoi. BLIP: bootstrapping language- image pre-training for unified vision- language understanding and generation. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, editors, International Conference on Ma- chine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA , vol- ume 162 ofProc...
work page 2022
- [72]
-
[73]
J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language- image pre-training with frozen image en- coders and large language models. In A. Krause, E. Brunskill, K. Cho, B. En- gelhardt, S. Sabato, and J. Scarlett, ed- itors, International Conference on Ma- chine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of...
work page 2023
- [74]
-
[75]
Y. Li, G. Li, L. He, J. Zheng, H. Li, and Z. Guan. Widget captioning: Generat- ing natural language description for mo- bileuser interface elements. InConference on Empirical Methods in Natural Language Processing, 2020
work page 2020
- [76]
- [77]
-
[78]
B. Lin, Z. Tang, Y. Ye, J. Cui, B. Zhu, P. Jin, J. Huang, J. Zhang, Y. Pang, M. Ning, and L. Yuan. Moe-llava: Mixture of ex- perts for large vision-language models,
- [79]
-
[80]
T. Lin, M. Maire, S. J. Belongie, L. D. Bour- dev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll’a r, and C. L. Zitnick. Microsoft COCO: common objects in con- text. CoRR, abs/1405.0312, 2014. URL http://arxiv.org/abs/1405.0312
work page internal anchor Pith review arXiv 2014
-
[81]
F. Liu, E. Bugliarello, E. M. Ponti, S. Reddy, N. Collier, and D. Elliott. Visually grounded reasoning across languages and cultures. In M.-F. Moens, X. Huang, L. Spe- cia, and S. W.-t. Yih, editors, Proceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467–10485, Online and Punta Cana, Dominican Republic, Nov. ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.