arxiv: 2502.14786 · v1 · submitted 2025-02-20 · 💻 cs.CV · cs.AI

Recognition: no theorem link

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen , Alexey Gritsenko , Xiao Wang , Muhammad Ferjad Naeem , Ibrahim Alabdulmohsin , Nikhil Parthasarathy , Talfan Evans , Lucas Beyer

show 6 more authors

Ye Xia Basil Mustafa Olivier H\'enaff Jeremiah Harmsen Andreas Steiner Xiaohua Zhai

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language encodersSigLIPmultilingual vision-language modelszero-shot classificationimage-text retrievallocalizationdense predictionself-supervised learning

0 comments

The pith

SigLIP 2 encoders outperform the original SigLIP at every scale on core vision-language tasks and show large gains on localization and dense prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SigLIP 2, a family of multilingual vision-language encoders that extend the original SigLIP image-text objective with captioning-based pretraining, self-supervised losses such as self-distillation and masked prediction, and online data curation. These additions are combined into one training recipe, and the resulting models beat their SigLIP counterparts across model sizes in zero-shot classification, image-text retrieval, and transfer to vision-language models. The same recipe produces clear improvements on localization and dense feature tasks, supports multiple resolutions while preserving native aspect ratios, and uses a more diverse de-biased data mixture to strengthen multilingual performance and fairness. Checkpoints are released at four sizes from 86 million to 1 billion parameters so users can balance speed and accuracy.

Core claim

SigLIP 2 models trained with the extended recipe that unifies captioning pretraining, self-supervised objectives, and online curation outperform prior SigLIP versions at all scales on zero-shot classification, image-text retrieval, and visual representation transfer for VLMs, while also delivering significant gains on localization and dense prediction tasks; multi-resolution variants preserve native aspect ratios and a de-biased diverse data mixture improves multilingual understanding and fairness.

What carries the argument

The unified training recipe that adds captioning-based pretraining, self-supervised losses (self-distillation and masked prediction), and online data curation to the base SigLIP image-text objective, plus multi-resolution support and de-biasing on a diverse data mixture.

If this is right

Outperforms original SigLIP at every model scale on zero-shot classification and image-text retrieval.
Better visual representations for downstream vision-language models.
Substantial gains on localization and dense prediction benchmarks.
Multi-resolution models that keep native aspect ratios improve flexibility.
De-biased diverse training yields stronger multilingual results and fairness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The localization and dense-feature improvements could make these encoders more useful for tasks like object detection or segmentation inside larger systems.
Releasing multiple sizes from 86M to 1B parameters lets practitioners match model capacity to available compute while keeping the same training benefits.
The de-biasing step may reduce cultural or linguistic skew in applications that serve global users, though its effect on other biases remains untested here.
Because the gains come from a modular recipe, similar combinations could be tested on other vision-language bases to check whether they transfer.

Load-bearing premise

That the added captioning pretraining, self-supervised losses, and online curation combine without negative interactions or overfitting to the chosen data mixture, and that de-biasing improves fairness without hurting main performance.

What would settle it

Retraining the exact original SigLIP architecture and data with only the new combined recipe and checking whether zero-shot accuracy, retrieval scores, and localization metrics rise by the claimed margins without trade-offs.

read the original abstract

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SigLIP 2 folds captioning pretraining, self-distillation, masked prediction, and online curation into the original contrastive recipe and reports gains on zero-shot, retrieval, localization, and multilingual tasks across scales.

read the letter

The core update is straightforward: they take the SigLIP contrastive baseline and add a set of previously published pieces—captioning pretraining, self-supervised losses, and online data curation—then train on a broader, de-biased mixture. The result is consistent lifts over the prior SigLIP models at every size from ViT-B to 1B, with the biggest reported improvements on localization and dense prediction. They also ship multi-resolution variants that keep native aspect ratios and release the checkpoints, which is immediately useful for anyone swapping encoders into VLMs or retrieval systems. The multilingual and fairness angle from the data mix is a practical addition rather than a side claim. What the paper does cleanly is show that these pieces can be combined without obvious breakage and that the gains appear across standard benchmarks and transfer settings. The multi-scale release and the focus on dense features give it more immediate engineering value than a pure scaling paper. The main soft spot is attribution. The abstract and results tie the improvements to the unified recipe, but the write-up does not yet isolate how much each added loss or curation step contributes versus simply using more or better data. Controls for total compute and data volume would make the causal story tighter, and the fairness claims would benefit from explicit before-and-after metrics on the core tasks. No internal contradictions jump out, and the work stays grounded in the prior SigLIP results rather than overclaiming novelty. This is for groups that train or fine-tune vision-language models and want a stronger off-the-shelf encoder, especially if they care about localization or non-English performance. It is incremental engineering rather than a new paradigm, but the empirical pattern is clear enough to be worth checking. I would send it to peer review; the claims are testable and the released models let others verify quickly.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces SigLIP 2, a family of multilingual vision-language encoders extending the original SigLIP image-text objective with captioning-based pretraining, self-supervised losses (self-distillation and masked prediction), and online data curation. The central claim is that this unified recipe yields consistent outperformance over SigLIP baselines at all scales (ViT-B to 1B) on zero-shot classification, image-text retrieval, and VLM transfer tasks, plus substantial gains on localization and dense prediction. Additional variants support multiple resolutions while preserving native aspect ratios, and a more diverse de-biased data mixture improves multilingual understanding and fairness. Checkpoints are released at four sizes.

Significance. If the empirical results hold with proper controls, the work would provide a stronger, practical baseline for vision-language pretraining by showing additive benefits from combining established techniques. Improvements in localization/dense features and multilingual fairness address real limitations in current encoders, and the multi-scale releases enable cost-performance trade-offs. The approach of unifying prior methods into a single recipe could influence subsequent training pipelines, though its value depends on whether gains are attributable to the recipe rather than uncontrolled factors such as total compute or data volume.

major comments (1)

The abstract asserts consistent outperformance and localization gains but provides no quantitative results, ablation studies, or details on experimental controls (e.g., matched data volume, training steps, or resolution); this makes it impossible to assess whether the reported improvements are load-bearing for the central claim or could be explained by confounding factors.

minor comments (2)

Notation for the extended loss (captioning + self-supervised terms) should be defined explicitly, including weighting coefficients, to allow reproduction.
Clarify how online data curation interacts with the de-biasing mixture; any overlap or filtering steps should be described to avoid ambiguity in the data pipeline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address the single major comment below and have prepared revisions to strengthen the presentation of our results.

read point-by-point responses

Referee: The abstract asserts consistent outperformance and localization gains but provides no quantitative results, ablation studies, or details on experimental controls (e.g., matched data volume, training steps, or resolution); this makes it impossible to assess whether the reported improvements are load-bearing for the central claim or could be explained by confounding factors.

Authors: We agree that the abstract, due to its length constraints, does not contain specific quantitative results, ablation details, or explicit statements on experimental controls. The full manuscript addresses these points through quantitative comparisons across multiple tables and figures, ablation studies in Section 4 that isolate the contribution of each added component (captioning, self-supervised losses, and data curation), and Section 3 which describes the training protocol with matched data volumes, step counts, and resolutions relative to the SigLIP baselines. To make this immediately visible, we will revise the abstract to include a small number of key performance deltas and a brief reference to the controlled experimental setup. These changes ensure the central claim can be evaluated without requiring the reader to consult the full text first. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical recipe evaluated on external benchmarks

full rationale

The paper describes an empirical training recipe that extends the prior SigLIP objective with captioning pretraining, self-supervised losses, and online curation, then reports performance gains on standard zero-shot, retrieval, VLM transfer, localization, and dense-prediction benchmarks. No equations, uniqueness theorems, or first-principles derivations are present that could reduce a claimed result to a fitted parameter or self-referential definition. Self-citations to the original SigLIP work serve only as the baseline for comparison and do not carry the load of proving the new gains; those gains are measured against held-out test sets. The argument is therefore self-contained against external benchmarks and contains no circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

As an empirical scaling paper, the central claim rests on standard assumptions of deep learning optimization and data representativeness rather than new mathematical derivations.

free parameters (2)

loss weighting coefficients
Weights balancing the original contrastive loss with added captioning and self-supervised terms are chosen during training.
data mixture proportions
Proportions in the diverse multilingual data mixture including de-biasing are selected to achieve reported fairness gains.

axioms (2)

domain assumption ViT-based encoder architecture behaves consistently under the added objectives
The paper assumes the base SigLIP architecture scales without modification when new losses are introduced.
domain assumption Online data curation selects representative samples without introducing selection bias
Assumes the curation process improves quality without distorting the underlying data distribution.

pith-pipeline@v0.9.0 · 5571 in / 1351 out tokens · 69744 ms · 2026-05-10T15:44:04.883873+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models
cs.CR 2026-05 conditional novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...
Representation Fr\'echet Loss for Visual Generation
cs.CV 2026-04 unverdicted novelty 8.0

Fréchet Distance optimized as FD-loss in representation space by decoupling population size from batch size improves generator quality, enables one-step generation from multi-step models, and motivates a multi-represe...
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 7.0

Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...
CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models
cs.CV 2026-05 conditional novelty 7.0

LiteLVLM prunes visual tokens for pixel grounding by reversing CLIP visual-text similarity to retain referent region tokens, outperforming prior methods by over 5% with 22% speedup and 2.3x memory reduction without an...
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference
cs.CV 2026-05 unverdicted novelty 7.0

VIP evolves text prompts using visual cues and saliency-aware aggregation inside dino.txt to deliver 1.4-8.4% higher mIoU on dense vision-language tasks with low overhead.
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation
cs.RO 2026-05 conditional novelty 7.0

A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 7.0

LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
cs.CL 2026-05 unverdicted novelty 7.0

Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.
BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing
cs.CV 2026-05 unverdicted novelty 7.0

BRIDGE uses separate main and subject paths plus a discrete gate on positional embeddings to improve local edits with coarse masks, raising local SigLIP2-T from 0.39 to 0.50 on its benchmark.
Attention Transfer Is Not Universally Effective for Vision Transformers
cs.CV 2026-05 accept novelty 7.0

Attention transfer from ViT teachers succeeds for only 7 of 11 families and fails for the rest because of architectural mismatch between teacher and student.
Attributions All the Way Down? The Metagame of Interpretability
cs.LG 2026-05 unverdicted novelty 7.0

Defines meta-attributions as directional second-order Shapley values on attribution methods, proves hierarchical decomposition of attributions, and demonstrates applications in language models, vision-language encoder...
OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention
cs.CV 2026-05 unverdicted novelty 7.0

OpenGaFF combines a geometry-conditioned Gaussian Feature Field with codebook-guided attention to deliver more spatially coherent open-vocabulary 3D semantic segmentation than prior methods.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 7.0

MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
Posterior Augmented Flow Matching
cs.CV 2026-05 unverdicted novelty 7.0

PAFM augments flow matching with an importance-sampled mixture over an approximate posterior of target completions, yielding an unbiased lower-variance estimator that improves FID by up to 3.4 on ImageNet and CC12M.
Differentially Private Contrastive Learning via Bounding Group-level Contribution
cs.CR 2026-04 unverdicted novelty 7.0

DP-GCL improves differentially private contrastive learning by bounding group-level contributions through batch partitioning and intra-group augmentation, delivering 5.6% higher image classification accuracy and 20.1%...
GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution
cs.CV 2026-04 unverdicted novelty 7.0

GramSR uses DINOv3 visual features instead of text captions to condition a one-step diffusion model for super-resolution via sequential pixel, semantic, and texture LoRA modules.
StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition
cs.GR 2026-04 unverdicted novelty 7.0

StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking
cs.CV 2026-04 unverdicted novelty 7.0

RSRCC is a new 126k-question benchmark for fine-grained remote sensing change question-answering, constructed via a hierarchical semi-supervised pipeline with retrieval-augmented Best-of-N ranking.
Evaluating Remote Sensing Image Captions Beyond Metric Biases
cs.CV 2026-04 unverdicted novelty 7.0

Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA pe...
Hybrid Latent Reasoning with Decoupled Policy Optimization
cs.CV 2026-04 unverdicted novelty 7.0

HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
Coevolving Representations in Joint Image-Feature Diffusion
cs.CV 2026-04 unverdicted novelty 7.0

CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample ...
Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes
cs.CV 2026-04 unverdicted novelty 7.0

Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.
Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
cs.CV 2026-04 unverdicted novelty 7.0

Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.
UNIGEOCLIP: Unified Geospatial Contrastive Learning
cs.CV 2026-04 unverdicted novelty 7.0

UNIGEOCLIP creates a unified embedding for aerial imagery, street views, elevation, text, and coordinates via all-to-all contrastive alignment plus a scaled lat-long encoder, outperforming single-modality and coordina...
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Bottleneck Tokens for Unified Multimodal Retrieval
cs.LG 2026-04 unverdicted novelty 7.0

Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
RewardFlow: Generate Images by Optimizing What You Reward
cs.CV 2026-04 unverdicted novelty 7.0

RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
cs.CV 2026-04 unverdicted novelty 7.0

InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
Show Me the Infographic I Imagine: Intent-Aware Infographic Retrieval for Authoring Support
cs.IR 2026-04 unverdicted novelty 7.0

Presents a new retrieval system that enriches user queries with an intent taxonomy to improve matching of natural language descriptions to infographic designs and support authoring.
Personalizing Text-to-Image Generation to Individual Taste
cs.CV 2026-04 unverdicted novelty 7.0

PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
cs.CV 2026-04 conditional novelty 7.0

Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.
StyleTextGen: Style-Conditioned Multilingual Scene Text Generation
cs.CV 2026-05 unverdicted novelty 6.0

StyleTextGen proposes a dual-branch style encoder, text style consistency loss, and mask-guided inference to achieve superior style consistency and cross-lingual performance in multilingual scene text generation on a ...
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 6.0

MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
Elastic Attention Cores for Scalable Vision Transformers
cs.CV 2026-05 unverdicted novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
Unlocking UML Class Diagram Understanding in Vision Language Models
cs.CV 2026-05 unverdicted novelty 6.0

A new UML class diagram VQA benchmark and 16k dataset enable LoRA fine-tuning to outperform Qwen 3.5 27B.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 conditional novelty 6.0

Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 unverdicted novelty 6.0

Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs
cs.CV 2026-05 unverdicted novelty 6.0

Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation w...
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
cs.CV 2026-05 unverdicted novelty 6.0

A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
VISOR: A Vision-Language Model-based Test Oracle for Testing Robot
cs.SE 2026-05 unverdicted novelty 6.0

VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...
How Mobile World Model Guides GUI Agents?
cs.AI 2026-05 unverdicted novelty 6.0

Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out...
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
cs.CV 2026-05 unverdicted novelty 6.0

LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
cs.CL 2026-05 unverdicted novelty 6.0

GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
cs.CV 2026-05 unverdicted novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing
cs.CV 2026-05 unverdicted novelty 6.0

BRIDGE improves coarse-mask local image editing in DiT models by routing background and subject paths separately and using a discrete geometric gate on positional embeddings to reduce mask-shape bias.
MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation
cs.CV 2026-05 unverdicted novelty 6.0

MSD-Score introduces multi-scale distributional scoring on von Mises-Fisher mixtures to evaluate image captions without references and reports state-of-the-art correlation with human judgments.
ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters
cs.CV 2026-05 unverdicted novelty 6.0

ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier i...
Taming Outlier Tokens in Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 6.0

Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
cs.SD 2026-05 accept novelty 6.0

MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.
Text-Conditional JEPA for Learning Semantically Rich Visual Representations
cs.LG 2026-05 unverdicted novelty 6.0

TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 6.0

MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
Two-Pass Zero-Shot Temporal-Spatial Grounding of Rare Traffic Events in Surveillance Video
cs.CV 2026-05 unverdicted novelty 6.0

A two-pass pipeline with Qwen3-VL-Plus and Gemini 3.1 Flash-Lite achieves 0.539 accuracy on the ACCIDENT@CVPR 2026 benchmark of 2,027 real CCTV videos for zero-shot temporal-spatial grounding of traffic events.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 6.0

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
cs.CV 2026-05 unverdicted novelty 6.0

An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
cs.CV 2026-04 unverdicted novelty 6.0

LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentatio...
Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics
cs.CV 2026-04 conditional novelty 6.0

CLIP models understand 360-degree textual semantics via explicit identifiers but show limited comprehension of visual semantics under horizontal circular shifts, which a LoRA fine-tuning approach improves with a noted...

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 122 Pith papers · 8 internal anchors

[1]

Alabdulmohsin, X

I. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. In NeurIPS, 2023

work page 2023
[2]

Alabdulmohsin, X

I. Alabdulmohsin, X. Wang, A. P. Steiner, P. Goyal, A. D’Amour, and X. Zhai. Clip the bias: How useful is balancing data in multimodal learning? InICLR, 2024

work page 2024
[3]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P.Wang, J.Lin, C.Zhou, andJ.Zhou. Qwen- VL: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Barbu, D

A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and B. Katz. Objectnet: A large-scale bias- controlled dataset for pushing the limits of object recognition models.NeurIPS, 2019

work page 2019
[5]

Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020

L. Beyer, O. J. Hénaff, A. Kolesnikov, X. Zhai, and A. v. d. Oord. Are we done with ima- genet? arXiv:2006.07159, 2020

work page arXiv 2006
[6]

Beyer, P

L. Beyer, P. Izmailov, A. Kolesnikov, M. Caron, S. Kornblith, X. Zhai, M. Min- derer, M. Tschannen, I. Alabdulmohsin, and F. Pavetic. Flexivit: One model for all patch sizes. InCVPR, 2023

work page 2023
[7]

PaliGemma: A versatile 3B VLM for transfer

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neu- mann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, 12 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, ...

work page internal anchor Pith review arXiv 2024
[8]

Caesar, J

H. Caesar, J. Uijlings, and V. Ferrari. Coco- stuff: Thing and stuff classes in context. In CVPR, 2018

work page 2018
[9]

Caron, H

M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vi- sion transformers. In CVPR, pages 9650– 9660, 2021

work page 2021
[10]

X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, A. Kolesnikov, J. Puigcerver, N. Ding, K. Rong, H. Akbari, G. Mishra, L. Xue, A. Thapliyal, J. Bradbury, W. Kuo, M. Seyedhosseini, C. Jia, B. K. Ayan, C. Riquelme, A. Steiner, A. Angelova, X. Zhai, N. Houlsby, and R. Soricut. PaLI: A j...

work page arXiv 2022
[11]

S.Cho, H.Shin, S.Hong, A.Arnab, P.H.Seo, and S. Kim. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. In CVPR, pages 4113–4123, 2024

work page 2024
[12]

Dehghani, B

M. Dehghani, B. Mustafa, J. Djolonga, J. Heek, M. Minderer, M. Caron, A. Steiner, J. Puigcerver, R. Geirhos, I. M. Alabdul- mohsin, et al. Patch n’pack: NaViT, a vi- sion transformer for any aspect ratio and resolution. NeurIPS, 2024

work page 2024
[13]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hi- erarchical image database. InCVPR, pages 248–255, 2009

work page 2009
[14]

J. Ding, N. Xue, G.-S. Xia, and D. Dai. De- coupling zero-shot semantic segmentation. In CVPR, pages 11583–11592, 2022

work page 2022
[15]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transform- ers for image recognition at scale. InICLR, 2021

work page 2021
[16]

Evans, N

T. Evans, N. Parthasarathy, H. Merzic, and O. J. Henaff. Data curation via joint exam- ple selection further accelerates multimodal learning. In NeurIPS Datasets and Bench- marks Track, 2024

work page 2024
[17]

Everingham, L

M. Everingham, L. Van Gool, C. K. Williams, J.Winn,andA.Zisserman. Thepascalvisual object classes (voc) challenge.IJCV, 2010

work page 2010
[18]

L. Fan, D. Krishnan, P. Isola, D. Katabi, and Y. Tian. Improving clip training with lan- guage rewrites. NeurIPS, pages 35544– 35575, 2023

work page 2023
[19]

A. Fang, A. M. Jose, A. Jain, L. Schmidt, A. T. Toshev, and V. Shankar. Data filtering networks. InICLR, 2024

work page 2024
[20]

E. Fini, M. Shukor, X. Li, P. Dufter, M. Klein, D. Haldimann, S. Aitharaju, V. G. T. da Costa, L. Béthune, Z. Gan, A. T. Toshev, M. Eichner, M. Nabi, Y. Yang, J. M. Susskind, and A. El-Nouby. Multimodal autoregres- sive pre-training of large vision encoders. arXiv:2411.14402, 2024

work page arXiv 2024
[21]

S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G.Smyrnis, T.Nguyen, R.Marten, M.Worts- man, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multi- modal datasets.NeurIPS, 36, 2024

work page 2024
[22]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team. Gemma: Open models based on gemini research and technology. arXiv:2403.08295, 2024

work page internal anchor Pith review arXiv 2024
[23]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size. arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Introduction to Cloud TPU

Google Cloud. Introduction to Cloud TPU. https://cloud.google.com/ tpu/docs/intro-to-tpu, 20xx. Ac- cessed: 2024-07-04. 13 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

work page 2024
[25]

Gupta, P

A. Gupta, P. Dollar, and R. Girshick. Lvis: A dataset for large vocabulary instance seg- mentation. In CVPR, pages 5356–5364, 2019

work page 2019
[26]

T.-Y. Hsu, C. L. Giles, and T.-H. Huang. Sc- icap: Generating captions for scientific fig- ures. arXiv:2110.11624, 2021

work page arXiv 2021
[27]

Ilharco, M

G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V. Shankar, H. Namkoong, J. Miller, H. Ha- jishirzi, A. Farhadi, and L. Schmidt. Open- CLIP, 2021

work page 2021
[28]

C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig. Scaling up visual and vision- language representation learning with noisy text supervision. InICML, 2021

work page 2021
[29]

S.Kazemzadeh,V.Ordonez,M.Matten,and T. Berg. ReferItGame: Referring to objects inphotographsofnaturalscenes. In EMNLP, Oct. 2014

work page 2014
[30]

W. Kuo, Y. Cui, X. Gu, A. Piergiovanni, and A. Angelova. Open-vocabulary object de- tection upon frozen vision and language models. InICLR, 2023

work page 2023
[31]

Z. Lai, H. Zhang, B. Zhang, W. Wu, H. Bai, A. Timofeev, X. Du, Z. Gan, J. Shan, C.-N. Chuah, Y. Yang, and M. Cao. VeCLIP: Im- provingcliptrainingviavisual-enrichedcap- tions. arXiv:2310.07699, 2024

work page arXiv 2024
[32]

J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language-image pre- training with frozen image encoders and large language models. InICML, 2023

work page 2023
[33]

X. Li, Z. Wang, and C. Xie. Clipa-v2: Scal- ing clip training with 81.1% zero-shot im- agenet accuracy within a $10,000 budget; an extra $4,000 unlocks 81.8% accuracy. arXiv:2306.15658, 2023

work page arXiv 2023
[34]

T. Lin, M. Maire, S. J. Belongie, L. D. Bour- dev, R. B. Girshick, J. Hays, P. Perona, D. Ra- manan, P. Doll’a r, and C. L. Zitnick. Mi- crosoft COCO: common objects in context. arXiv:1405.0312, 2014

work page internal anchor Pith review arXiv 2014
[35]

H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. InNeurIPS, 2023

work page 2023
[36]

S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. ICDAR 2023 com- petition on hierarchical text detection and recognition. InICDAR, 2023

work page 2023
[37]

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter, et al. Fixing weight decayregularizationinadam. arXivpreprint arXiv:1711.05101, 5, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Maninis, K

K.-K. Maninis, K. Chen, S. Ghosh, A. Karpur, K. Chen, Y. Xia, B. Cao, D. Salz, G. Han, J.Dlabal,etal. TIPS:Text-imagepretraining with spatial awareness. InICLR, 2025

work page 2025
[39]

Mm1: Methods, analysis & insights from multimodal llm pre-training

B.McKinzie, Z.Gan, J.Fauconnier, S.Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers, A. Belyi, H. Zhang, K. Singh, D. Kang, A. Jain, H. Hè, M. Schwarzer, T. Gunter, X. Kong, A. Zhang, J. Wang, C. Wang, N. Du, T. Lei, S. Wiseman, G. Yin, M. Lee, Z. Wang, R. Pang, P. Grasch, A. To- shev, and Y. Yang. MM1: methods, anal- ysis & insights from mul...

work page arXiv 2024
[40]

Minderer, A

M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovit- skiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, et al. Simple open-vocabulary ob- ject detection. In ECCV, pages 728–755, 2022

work page 2022
[41]

Minderer, A

M. Minderer, A. A. Gritsenko, and N. Houlsby. Scaling open-vocabulary object detection. InNeurIPS, 2023

work page 2023
[42]

Sharma, A

S.Mindermann, J.M.Brauner, M.T.Razzak, M. Sharma, A. Kirsch, W. Xu, B. Höltgen, A. N. Gomez, A. Morisot, S. Farquhar, et al. Prioritized training on points that are learn- able, worth learning, and not yet learnt. In ICML, pages 15630–15649, 2022

work page 2022
[43]

Mottaghi, X

R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.- W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semanticsegmentationinthewild. In CVPR, 2014. 14 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

work page 2014
[44]

N. Mu, A. Kirillov, D. Wagner, and S. Xie. SLIP: Self-supervision meets language- image pre-training. In ECCV, pages 529– 544, 2022

work page 2022
[45]

M. F. Naeem, Y. Xian, X. Zhai, L. Hoyer, L. Van Gool, and F. Tombari. SILC: Improv- ing vision language pretraining with self- distillation. InECCV, pages 38–55, 2024

work page 2024
[46]

Nguyen, S

T. Nguyen, S. Y. Gadre, G. Ilharco, S. Oh, and L. Schmidt. Improving multimodal datasets with image captioning.NeurIPS, 36, 2024

work page 2024
[47]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M.Szafraniec, V.Khalidov, P.Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Di- nov2: Learning robust visual features with- out supervision.TMLR, 2024

work page 2024
[48]

Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei. Kosmos-2: Grounding multimodal large language mod- els to the world.arXiv:2306.14824, 2023

work page internal anchor Pith review arXiv 2023
[49]

Pouget, L

A. Pouget, L. Beyer, E. Bugliarello, X. Wang, A. P. Steiner, X. Zhai, and I. Alabdulmohsin. No filter: Cultural and socioeconomic diver- sityin contrastive vision-language models. arXiv:2405.13777, 2024

work page arXiv 2024
[50]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable vi- sual models from natural language supervi- sion. InICML, 2021

work page 2021
[51]

V. V. Ramaswamy, S. Y. Lin, D. Zhao, A. Ad- cock, L. van der Maaten, D. Ghadiyaram, and O. Russakovsky. Geode: a geographi- cally diverse evaluation dataset for object recognition. NeurIPS, 36, 2024

work page 2024
[52]

Ranftl, A

R. Ranftl, A. Bochkovskiy, and V. Koltun. Vision transformers for dense prediction. In CVPR, pages 12179–12188, 2021

work page 2021
[53]

Recht, R

B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do imagenet classifiers gen- eralize to imagenet? InICML, pages 5389– 5400, 2019

work page 2019
[54]

W. A. G. Rojas, S. Diamos, K. R. Kini, D. Kan- ter, V. J. Reddi, and C. Coleman. The dollar street dataset: Images representing the geo- graphic and socioeconomic diversity of the world. InNeurIPS Datasets and Benchmarks Track, 2022

work page 2022
[55]

Sidorov, R

O. Sidorov, R. Hu, M. Rohrbach, and A. Singh. TextCaps: A dataset for image captioning with reading comprehension. In ECCV, 2020

work page 2020
[56]

A.Steiner,A.S.Pinto,M.Tschannen,D.Key- sers, X. Wang, Y. Bitton, A. Gritsenko, M. Minderer, A. Sherbondy, S. Long, et al. Paligemma 2: A family of versatile vlms for transfer. arXiv:2412.03555, 2024

work page arXiv 2024
[57]

Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao. EVA-CLIP: Improved training techniques for clip at scale.arXiv:2303.15389, 2023

work page internal anchor Pith review arXiv 2023
[58]

A. V. Thapliyal, J. Pont Tuset, X. Chen, and R. Soricut. Crossmodal-3600: A massively multilingual multimodal evaluation dataset. In EMNLP, 2022

work page 2022
[59]

S. Tong, E. Brown, P. Wu, S. Woo, M. Midde- pogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, A. Wang, R. Fergus, Y. LeCun, and S. Xie. Cambrian-1: A Fully Open, Vision- Centric Exploration of Multimodal LLMs. arXiv:2406.16860, 2024

work page arXiv 2024
[60]

Houlsby, and L

M.Tschannen,M.Kumar,A.Steiner,X.Zhai, N. Houlsby, and L. Beyer. Image captioners are scalable vision learners too. InNeurIPS, 2023

work page 2023
[61]

Udandarao, N

V. Udandarao, N. Parthasarathy, M. F. Naeem, T. Evans, S. Albanie, F. Tombari, Y. Xian, A. Tonioni, and O. J. Hénaff. Active data curation effectively distills large-scale multimodal models. arXiv:2411.18674, 2024

work page arXiv 2024
[62]

B. Wan, M. Tschannen, Y. Xian, F. Pavetic, I. Alabdulmohsin, X. Wang, A. S. Pinto, A. Steiner, L. Beyer, and X. Zhai. LocCa: Visual pretraining with location-aware cap- tioners. InNeurIPS, 2024. 15 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

work page 2024
[63]

B. Wang, G. Li, X. Zhou, Z. Chen, T. Gross- man, and Y. Li. Screen2words: Automatic mobile ui summarization with multimodal learning. In Symposium on User Interface Software and Technology, 2021

work page 2021
[64]

Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao. SimVLM: Simple visual lan- guage model pretraining with weak super- vision. InICLR, 2022

work page 2022
[65]

Weyand, A

T. Weyand, A. Araujo, B. Cao, and J. Sim. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. InCVPR, pages 2575–2584, 2020

work page 2020
[66]

H. Xu, S. Xie, X. Tan, P.-Y. Huang, R. Howes, V. Sharma, S.-W. Li, G. Ghosh, L. Zettle- moyer, and C. Feichtenhofer. Demystifying clip data. InICLR, 2024

work page 2024
[67]

J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu. CoCa: Con- trastive captioners are image-text founda- tion models.TMLR, 2022

work page 2022
[68]

L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. InECCV, pages 69–85, 2016

work page 2016
[69]

X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers.CVPR, 2022

work page 2022
[70]

X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, and L. Beyer. Lit: Zero-shot transfer with locked-image text tuning. InCVPR, 2022

work page 2022
[71]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. InICCV, 2023

work page 2023
[72]

Y. Zhao, A. Gu, R. Varma, L. Luo, C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Da- mania, B. Nguyen, G. Chauhan, Y. Hao, A.Mathews, andS.Li. PytorchFSDP:experi- ences on scaling fully sharded data parallel. VLDB, 2023

work page 2023
[73]

B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Bar- riuso, and A. Torralba. Scene parsing through ade20k dataset. InCVPR, 2017

work page 2017
[74]

B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba. Semantic un- derstanding of scenes through the ade20k dataset. IJCV, 2019

work page 2019
[75]

J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong. Image BERT pre- training with online tokenizer. In ICLR, 2022. 16 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Appendix A. Full PaliGemma results Large 224/256px So400m/14 224px So400m 384px SigLIP AIMv2 SigLIP2 SigL...

work page 2022