Recognition: no theorem link
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
Pith reviewed 2026-05-10 15:44 UTC · model grok-4.3
The pith
SigLIP 2 encoders outperform the original SigLIP at every scale on core vision-language tasks and show large gains on localization and dense prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SigLIP 2 models trained with the extended recipe that unifies captioning pretraining, self-supervised objectives, and online curation outperform prior SigLIP versions at all scales on zero-shot classification, image-text retrieval, and visual representation transfer for VLMs, while also delivering significant gains on localization and dense prediction tasks; multi-resolution variants preserve native aspect ratios and a de-biased diverse data mixture improves multilingual understanding and fairness.
What carries the argument
The unified training recipe that adds captioning-based pretraining, self-supervised losses (self-distillation and masked prediction), and online data curation to the base SigLIP image-text objective, plus multi-resolution support and de-biasing on a diverse data mixture.
If this is right
- Outperforms original SigLIP at every model scale on zero-shot classification and image-text retrieval.
- Better visual representations for downstream vision-language models.
- Substantial gains on localization and dense prediction benchmarks.
- Multi-resolution models that keep native aspect ratios improve flexibility.
- De-biased diverse training yields stronger multilingual results and fairness.
Where Pith is reading between the lines
- The localization and dense-feature improvements could make these encoders more useful for tasks like object detection or segmentation inside larger systems.
- Releasing multiple sizes from 86M to 1B parameters lets practitioners match model capacity to available compute while keeping the same training benefits.
- The de-biasing step may reduce cultural or linguistic skew in applications that serve global users, though its effect on other biases remains untested here.
- Because the gains come from a modular recipe, similar combinations could be tested on other vision-language bases to check whether they transfer.
Load-bearing premise
That the added captioning pretraining, self-supervised losses, and online curation combine without negative interactions or overfitting to the chosen data mixture, and that de-biasing improves fairness without hurting main performance.
What would settle it
Retraining the exact original SigLIP architecture and data with only the new combined recipe and checking whether zero-shot accuracy, retrieval scores, and localization metrics rise by the claimed margins without trade-offs.
read the original abstract
We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SigLIP 2, a family of multilingual vision-language encoders extending the original SigLIP image-text objective with captioning-based pretraining, self-supervised losses (self-distillation and masked prediction), and online data curation. The central claim is that this unified recipe yields consistent outperformance over SigLIP baselines at all scales (ViT-B to 1B) on zero-shot classification, image-text retrieval, and VLM transfer tasks, plus substantial gains on localization and dense prediction. Additional variants support multiple resolutions while preserving native aspect ratios, and a more diverse de-biased data mixture improves multilingual understanding and fairness. Checkpoints are released at four sizes.
Significance. If the empirical results hold with proper controls, the work would provide a stronger, practical baseline for vision-language pretraining by showing additive benefits from combining established techniques. Improvements in localization/dense features and multilingual fairness address real limitations in current encoders, and the multi-scale releases enable cost-performance trade-offs. The approach of unifying prior methods into a single recipe could influence subsequent training pipelines, though its value depends on whether gains are attributable to the recipe rather than uncontrolled factors such as total compute or data volume.
major comments (1)
- The abstract asserts consistent outperformance and localization gains but provides no quantitative results, ablation studies, or details on experimental controls (e.g., matched data volume, training steps, or resolution); this makes it impossible to assess whether the reported improvements are load-bearing for the central claim or could be explained by confounding factors.
minor comments (2)
- Notation for the extended loss (captioning + self-supervised terms) should be defined explicitly, including weighting coefficients, to allow reproduction.
- Clarify how online data curation interacts with the de-biasing mixture; any overlap or filtering steps should be described to avoid ambiguity in the data pipeline.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. We address the single major comment below and have prepared revisions to strengthen the presentation of our results.
read point-by-point responses
-
Referee: The abstract asserts consistent outperformance and localization gains but provides no quantitative results, ablation studies, or details on experimental controls (e.g., matched data volume, training steps, or resolution); this makes it impossible to assess whether the reported improvements are load-bearing for the central claim or could be explained by confounding factors.
Authors: We agree that the abstract, due to its length constraints, does not contain specific quantitative results, ablation details, or explicit statements on experimental controls. The full manuscript addresses these points through quantitative comparisons across multiple tables and figures, ablation studies in Section 4 that isolate the contribution of each added component (captioning, self-supervised losses, and data curation), and Section 3 which describes the training protocol with matched data volumes, step counts, and resolutions relative to the SigLIP baselines. To make this immediately visible, we will revise the abstract to include a small number of key performance deltas and a brief reference to the controlled experimental setup. These changes ensure the central claim can be evaluated without requiring the reader to consult the full text first. revision: yes
Circularity Check
No significant circularity; empirical recipe evaluated on external benchmarks
full rationale
The paper describes an empirical training recipe that extends the prior SigLIP objective with captioning pretraining, self-supervised losses, and online curation, then reports performance gains on standard zero-shot, retrieval, VLM transfer, localization, and dense-prediction benchmarks. No equations, uniqueness theorems, or first-principles derivations are present that could reduce a claimed result to a fitted parameter or self-referential definition. Self-citations to the original SigLIP work serve only as the baseline for comparison and do not carry the load of proving the new gains; those gains are measured against held-out test sets. The argument is therefore self-contained against external benchmarks and contains no circular steps.
Axiom & Free-Parameter Ledger
free parameters (2)
- loss weighting coefficients
- data mixture proportions
axioms (2)
- domain assumption ViT-based encoder architecture behaves consistently under the added objectives
- domain assumption Online data curation selects representative samples without introducing selection bias
Forward citations
Cited by 60 Pith papers
-
On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...
-
Representation Fr\'echet Loss for Visual Generation
Fréchet Distance optimized as FD-loss in representation space by decoupling population size from batch size improves generator quality, enables one-step generation from multi-step models, and motivates a multi-represe...
-
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.
-
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
-
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...
-
CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models
LiteLVLM prunes visual tokens for pixel grounding by reversing CLIP visual-text similarity to retain referent region tokens, outperforming prior methods by over 5% with 22% speedup and 2.3x memory reduction without an...
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
-
VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference
VIP evolves text prompts using visual cues and saliency-aware aggregation inside dino.txt to deliver 1.4-8.4% higher mIoU on dense vision-language tasks with low overhead.
-
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation
A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
-
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
-
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.
-
BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing
BRIDGE uses separate main and subject paths plus a discrete gate on positional embeddings to improve local edits with coarse masks, raising local SigLIP2-T from 0.39 to 0.50 on its benchmark.
-
Attention Transfer Is Not Universally Effective for Vision Transformers
Attention transfer from ViT teachers succeeds for only 7 of 11 families and fails for the rest because of architectural mismatch between teacher and student.
-
Attributions All the Way Down? The Metagame of Interpretability
Defines meta-attributions as directional second-order Shapley values on attribution methods, proves hierarchical decomposition of attributions, and demonstrates applications in language models, vision-language encoder...
-
OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention
OpenGaFF combines a geometry-conditioned Gaussian Feature Field with codebook-guided attention to deliver more spatially coherent open-vocabulary 3D semantic segmentation than prior methods.
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
-
Posterior Augmented Flow Matching
PAFM augments flow matching with an importance-sampled mixture over an approximate posterior of target completions, yielding an unbiased lower-variance estimator that improves FID by up to 3.4 on ImageNet and CC12M.
-
Differentially Private Contrastive Learning via Bounding Group-level Contribution
DP-GCL improves differentially private contrastive learning by bounding group-level contributions through batch partitioning and intra-group augmentation, delivering 5.6% higher image classification accuracy and 20.1%...
-
GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution
GramSR uses DINOv3 visual features instead of text captions to condition a one-step diffusion model for super-resolution via sequential pixel, semantic, and texture LoRA modules.
-
StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition
StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
-
RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking
RSRCC is a new 126k-question benchmark for fine-grained remote sensing change question-answering, constructed via a hierarchical semi-supervised pipeline with retrieval-augmented Best-of-N ranking.
-
Evaluating Remote Sensing Image Captions Beyond Metric Biases
Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA pe...
-
Hybrid Latent Reasoning with Decoupled Policy Optimization
HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
-
Coevolving Representations in Joint Image-Feature Diffusion
CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample ...
-
Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes
Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.
-
Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.
-
UNIGEOCLIP: Unified Geospatial Contrastive Learning
UNIGEOCLIP creates a unified embedding for aerial imagery, street views, elevation, text, and coordinates via all-to-all contrastive alignment plus a scaled lat-long encoder, outperforming single-modality and coordina...
-
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
-
Bottleneck Tokens for Unified Multimodal Retrieval
Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
-
RewardFlow: Generate Images by Optimizing What You Reward
RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.
-
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
-
Show Me the Infographic I Imagine: Intent-Aware Infographic Retrieval for Authoring Support
Presents a new retrieval system that enriches user queries with an intent taxonomy to improve matching of natural language descriptions to infographic designs and support authoring.
-
Personalizing Text-to-Image Generation to Individual Taste
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
-
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.
-
StyleTextGen: Style-Conditioned Multilingual Scene Text Generation
StyleTextGen proposes a dual-branch style encoder, text style consistency loss, and mask-guided inference to achieve superior style consistency and cross-lingual performance in multilingual scene text generation on a ...
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
-
Elastic Attention Cores for Scalable Vision Transformers
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
-
Unlocking UML Class Diagram Understanding in Vision Language Models
A new UML class diagram VQA benchmark and 16k dataset enable LoRA fine-tuning to outperform Qwen 3.5 27B.
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
-
Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs
Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation w...
-
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
-
VISOR: A Vision-Language Model-based Test Oracle for Testing Robot
VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...
-
How Mobile World Model Guides GUI Agents?
Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out...
-
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
-
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...
-
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
-
BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing
BRIDGE improves coarse-mask local image editing in DiT models by routing background and subject paths separately and using a discrete geometric gate on positional embeddings to reduce mask-shape bias.
-
MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation
MSD-Score introduces multi-scale distributional scoring on von Mises-Fisher mixtures to evaluate image captions without references and reports state-of-the-art correlation with human judgments.
-
ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters
ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier i...
-
Taming Outlier Tokens in Diffusion Transformers
Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
-
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.
-
Text-Conditional JEPA for Learning Semantically Rich Visual Representations
TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
-
Two-Pass Zero-Shot Temporal-Spatial Grounding of Rare Traffic Events in Surveillance Video
A two-pass pipeline with Qwen3-VL-Plus and Gemini 3.1 Flash-Lite achieves 0.539 accuracy on the ACCIDENT@CVPR 2026 benchmark of 2,027 real CCTV videos for zero-shot temporal-spatial grounding of traffic events.
-
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
-
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.
-
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentatio...
-
Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics
CLIP models understand 360-degree textual semantics via explicit identifiers but show limited comprehension of visual semantics under horizontal circular shifts, which a LoRA fine-tuning approach improves with a noted...
Reference graph
Works this paper leans on
-
[1]
I. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. In NeurIPS, 2023
work page 2023
-
[2]
I. Alabdulmohsin, X. Wang, A. P. Steiner, P. Goyal, A. D’Amour, and X. Zhai. Clip the bias: How useful is balancing data in multimodal learning? InICLR, 2024
work page 2024
-
[3]
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P.Wang, J.Lin, C.Zhou, andJ.Zhou. Qwen- VL: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [4]
-
[5]
Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020
L. Beyer, O. J. Hénaff, A. Kolesnikov, X. Zhai, and A. v. d. Oord. Are we done with ima- genet? arXiv:2006.07159, 2020
- [6]
-
[7]
PaliGemma: A versatile 3B VLM for transfer
L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neu- mann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, 12 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, ...
work page internal anchor Pith review arXiv 2024
- [8]
- [9]
-
[10]
X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, A. Kolesnikov, J. Puigcerver, N. Ding, K. Rong, H. Akbari, G. Mishra, L. Xue, A. Thapliyal, J. Bradbury, W. Kuo, M. Seyedhosseini, C. Jia, B. K. Ayan, C. Riquelme, A. Steiner, A. Angelova, X. Zhai, N. Houlsby, and R. Soricut. PaLI: A j...
-
[11]
S.Cho, H.Shin, S.Hong, A.Arnab, P.H.Seo, and S. Kim. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. In CVPR, pages 4113–4123, 2024
work page 2024
-
[12]
M. Dehghani, B. Mustafa, J. Djolonga, J. Heek, M. Minderer, M. Caron, A. Steiner, J. Puigcerver, R. Geirhos, I. M. Alabdul- mohsin, et al. Patch n’pack: NaViT, a vi- sion transformer for any aspect ratio and resolution. NeurIPS, 2024
work page 2024
-
[13]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hi- erarchical image database. InCVPR, pages 248–255, 2009
work page 2009
-
[14]
J. Ding, N. Xue, G.-S. Xia, and D. Dai. De- coupling zero-shot semantic segmentation. In CVPR, pages 11583–11592, 2022
work page 2022
-
[15]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transform- ers for image recognition at scale. InICLR, 2021
work page 2021
- [16]
-
[17]
M. Everingham, L. Van Gool, C. K. Williams, J.Winn,andA.Zisserman. Thepascalvisual object classes (voc) challenge.IJCV, 2010
work page 2010
-
[18]
L. Fan, D. Krishnan, P. Isola, D. Katabi, and Y. Tian. Improving clip training with lan- guage rewrites. NeurIPS, pages 35544– 35575, 2023
work page 2023
-
[19]
A. Fang, A. M. Jose, A. Jain, L. Schmidt, A. T. Toshev, and V. Shankar. Data filtering networks. InICLR, 2024
work page 2024
-
[20]
E. Fini, M. Shukor, X. Li, P. Dufter, M. Klein, D. Haldimann, S. Aitharaju, V. G. T. da Costa, L. Béthune, Z. Gan, A. T. Toshev, M. Eichner, M. Nabi, Y. Yang, J. M. Susskind, and A. El-Nouby. Multimodal autoregres- sive pre-training of large vision encoders. arXiv:2411.14402, 2024
-
[21]
S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G.Smyrnis, T.Nguyen, R.Marten, M.Worts- man, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multi- modal datasets.NeurIPS, 36, 2024
work page 2024
-
[22]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team. Gemma: Open models based on gemini research and technology. arXiv:2403.08295, 2024
work page internal anchor Pith review arXiv 2024
-
[23]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team. Gemma 2: Improving open language models at a practical size. arXiv:2408.00118, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Google Cloud. Introduction to Cloud TPU. https://cloud.google.com/ tpu/docs/intro-to-tpu, 20xx. Ac- cessed: 2024-07-04. 13 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
work page 2024
- [25]
- [26]
-
[27]
G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V. Shankar, H. Namkoong, J. Miller, H. Ha- jishirzi, A. Farhadi, and L. Schmidt. Open- CLIP, 2021
work page 2021
-
[28]
C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig. Scaling up visual and vision- language representation learning with noisy text supervision. InICML, 2021
work page 2021
-
[29]
S.Kazemzadeh,V.Ordonez,M.Matten,and T. Berg. ReferItGame: Referring to objects inphotographsofnaturalscenes. In EMNLP, Oct. 2014
work page 2014
-
[30]
W. Kuo, Y. Cui, X. Gu, A. Piergiovanni, and A. Angelova. Open-vocabulary object de- tection upon frozen vision and language models. InICLR, 2023
work page 2023
- [31]
-
[32]
J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language-image pre- training with frozen image encoders and large language models. InICML, 2023
work page 2023
- [33]
-
[34]
T. Lin, M. Maire, S. J. Belongie, L. D. Bour- dev, R. B. Girshick, J. Hays, P. Perona, D. Ra- manan, P. Doll’a r, and C. L. Zitnick. Mi- crosoft COCO: common objects in context. arXiv:1405.0312, 2014
work page internal anchor Pith review arXiv 2014
-
[35]
H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. InNeurIPS, 2023
work page 2023
-
[36]
S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. ICDAR 2023 com- petition on hierarchical text detection and recognition. InICDAR, 2023
work page 2023
-
[37]
Decoupled Weight Decay Regularization
I. Loshchilov, F. Hutter, et al. Fixing weight decayregularizationinadam. arXivpreprint arXiv:1711.05101, 5, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
K.-K. Maninis, K. Chen, S. Ghosh, A. Karpur, K. Chen, Y. Xia, B. Cao, D. Salz, G. Han, J.Dlabal,etal. TIPS:Text-imagepretraining with spatial awareness. InICLR, 2025
work page 2025
-
[39]
Mm1: Methods, analysis & insights from multimodal llm pre-training
B.McKinzie, Z.Gan, J.Fauconnier, S.Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers, A. Belyi, H. Zhang, K. Singh, D. Kang, A. Jain, H. Hè, M. Schwarzer, T. Gunter, X. Kong, A. Zhang, J. Wang, C. Wang, N. Du, T. Lei, S. Wiseman, G. Yin, M. Lee, Z. Wang, R. Pang, P. Grasch, A. To- shev, and Y. Yang. MM1: methods, anal- ysis & insights from mul...
-
[40]
M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovit- skiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, et al. Simple open-vocabulary ob- ject detection. In ECCV, pages 728–755, 2022
work page 2022
-
[41]
M. Minderer, A. A. Gritsenko, and N. Houlsby. Scaling open-vocabulary object detection. InNeurIPS, 2023
work page 2023
- [42]
-
[43]
R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.- W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semanticsegmentationinthewild. In CVPR, 2014. 14 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
work page 2014
-
[44]
N. Mu, A. Kirillov, D. Wagner, and S. Xie. SLIP: Self-supervision meets language- image pre-training. In ECCV, pages 529– 544, 2022
work page 2022
-
[45]
M. F. Naeem, Y. Xian, X. Zhai, L. Hoyer, L. Van Gool, and F. Tombari. SILC: Improv- ing vision language pretraining with self- distillation. InECCV, pages 38–55, 2024
work page 2024
- [46]
- [47]
-
[48]
Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei. Kosmos-2: Grounding multimodal large language mod- els to the world.arXiv:2306.14824, 2023
work page internal anchor Pith review arXiv 2023
- [49]
-
[50]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable vi- sual models from natural language supervi- sion. InICML, 2021
work page 2021
-
[51]
V. V. Ramaswamy, S. Y. Lin, D. Zhao, A. Ad- cock, L. van der Maaten, D. Ghadiyaram, and O. Russakovsky. Geode: a geographi- cally diverse evaluation dataset for object recognition. NeurIPS, 36, 2024
work page 2024
- [52]
- [53]
-
[54]
W. A. G. Rojas, S. Diamos, K. R. Kini, D. Kan- ter, V. J. Reddi, and C. Coleman. The dollar street dataset: Images representing the geo- graphic and socioeconomic diversity of the world. InNeurIPS Datasets and Benchmarks Track, 2022
work page 2022
-
[55]
O. Sidorov, R. Hu, M. Rohrbach, and A. Singh. TextCaps: A dataset for image captioning with reading comprehension. In ECCV, 2020
work page 2020
-
[56]
A.Steiner,A.S.Pinto,M.Tschannen,D.Key- sers, X. Wang, Y. Bitton, A. Gritsenko, M. Minderer, A. Sherbondy, S. Long, et al. Paligemma 2: A family of versatile vlms for transfer. arXiv:2412.03555, 2024
work page internal anchor Pith review arXiv 2024
-
[57]
Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao. EVA-CLIP: Improved training techniques for clip at scale.arXiv:2303.15389, 2023
work page internal anchor Pith review arXiv 2023
-
[58]
A. V. Thapliyal, J. Pont Tuset, X. Chen, and R. Soricut. Crossmodal-3600: A massively multilingual multimodal evaluation dataset. In EMNLP, 2022
work page 2022
- [59]
-
[60]
M.Tschannen,M.Kumar,A.Steiner,X.Zhai, N. Houlsby, and L. Beyer. Image captioners are scalable vision learners too. InNeurIPS, 2023
work page 2023
-
[61]
V. Udandarao, N. Parthasarathy, M. F. Naeem, T. Evans, S. Albanie, F. Tombari, Y. Xian, A. Tonioni, and O. J. Hénaff. Active data curation effectively distills large-scale multimodal models. arXiv:2411.18674, 2024
-
[62]
B. Wan, M. Tschannen, Y. Xian, F. Pavetic, I. Alabdulmohsin, X. Wang, A. S. Pinto, A. Steiner, L. Beyer, and X. Zhai. LocCa: Visual pretraining with location-aware cap- tioners. InNeurIPS, 2024. 15 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
work page 2024
-
[63]
B. Wang, G. Li, X. Zhou, Z. Chen, T. Gross- man, and Y. Li. Screen2words: Automatic mobile ui summarization with multimodal learning. In Symposium on User Interface Software and Technology, 2021
work page 2021
-
[64]
Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao. SimVLM: Simple visual lan- guage model pretraining with weak super- vision. InICLR, 2022
work page 2022
- [65]
-
[66]
H. Xu, S. Xie, X. Tan, P.-Y. Huang, R. Howes, V. Sharma, S.-W. Li, G. Ghosh, L. Zettle- moyer, and C. Feichtenhofer. Demystifying clip data. InICLR, 2024
work page 2024
-
[67]
J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu. CoCa: Con- trastive captioners are image-text founda- tion models.TMLR, 2022
work page 2022
-
[68]
L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. InECCV, pages 69–85, 2016
work page 2016
-
[69]
X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers.CVPR, 2022
work page 2022
-
[70]
X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, and L. Beyer. Lit: Zero-shot transfer with locked-image text tuning. InCVPR, 2022
work page 2022
-
[71]
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. InICCV, 2023
work page 2023
-
[72]
Y. Zhao, A. Gu, R. Varma, L. Luo, C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Da- mania, B. Nguyen, G. Chauhan, Y. Hao, A.Mathews, andS.Li. PytorchFSDP:experi- ences on scaling fully sharded data parallel. VLDB, 2023
work page 2023
-
[73]
B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Bar- riuso, and A. Torralba. Scene parsing through ade20k dataset. InCVPR, 2017
work page 2017
-
[74]
B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba. Semantic un- derstanding of scenes through the ade20k dataset. IJCV, 2019
work page 2019
-
[75]
J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong. Image BERT pre- training with online tokenizer. In ICLR, 2022. 16 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Appendix A. Full PaliGemma results Large 224/256px So400m/14 224px So400m 384px SigLIP AIMv2 SigLIP2 SigL...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.