pith. machine review for the scientific record. sign in

arxiv: 2502.14786 · v1 · submitted 2025-02-20 · 💻 cs.CV · cs.AI

Recognition: no theorem link

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language encodersSigLIPmultilingual vision-language modelszero-shot classificationimage-text retrievallocalizationdense predictionself-supervised learning
0
0 comments X

The pith

SigLIP 2 encoders outperform the original SigLIP at every scale on core vision-language tasks and show large gains on localization and dense prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SigLIP 2, a family of multilingual vision-language encoders that extend the original SigLIP image-text objective with captioning-based pretraining, self-supervised losses such as self-distillation and masked prediction, and online data curation. These additions are combined into one training recipe, and the resulting models beat their SigLIP counterparts across model sizes in zero-shot classification, image-text retrieval, and transfer to vision-language models. The same recipe produces clear improvements on localization and dense feature tasks, supports multiple resolutions while preserving native aspect ratios, and uses a more diverse de-biased data mixture to strengthen multilingual performance and fairness. Checkpoints are released at four sizes from 86 million to 1 billion parameters so users can balance speed and accuracy.

Core claim

SigLIP 2 models trained with the extended recipe that unifies captioning pretraining, self-supervised objectives, and online curation outperform prior SigLIP versions at all scales on zero-shot classification, image-text retrieval, and visual representation transfer for VLMs, while also delivering significant gains on localization and dense prediction tasks; multi-resolution variants preserve native aspect ratios and a de-biased diverse data mixture improves multilingual understanding and fairness.

What carries the argument

The unified training recipe that adds captioning-based pretraining, self-supervised losses (self-distillation and masked prediction), and online data curation to the base SigLIP image-text objective, plus multi-resolution support and de-biasing on a diverse data mixture.

If this is right

  • Outperforms original SigLIP at every model scale on zero-shot classification and image-text retrieval.
  • Better visual representations for downstream vision-language models.
  • Substantial gains on localization and dense prediction benchmarks.
  • Multi-resolution models that keep native aspect ratios improve flexibility.
  • De-biased diverse training yields stronger multilingual results and fairness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The localization and dense-feature improvements could make these encoders more useful for tasks like object detection or segmentation inside larger systems.
  • Releasing multiple sizes from 86M to 1B parameters lets practitioners match model capacity to available compute while keeping the same training benefits.
  • The de-biasing step may reduce cultural or linguistic skew in applications that serve global users, though its effect on other biases remains untested here.
  • Because the gains come from a modular recipe, similar combinations could be tested on other vision-language bases to check whether they transfer.

Load-bearing premise

That the added captioning pretraining, self-supervised losses, and online curation combine without negative interactions or overfitting to the chosen data mixture, and that de-biasing improves fairness without hurting main performance.

What would settle it

Retraining the exact original SigLIP architecture and data with only the new combined recipe and checking whether zero-shot accuracy, retrieval scores, and localization metrics rise by the claimed margins without trade-offs.

read the original abstract

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces SigLIP 2, a family of multilingual vision-language encoders extending the original SigLIP image-text objective with captioning-based pretraining, self-supervised losses (self-distillation and masked prediction), and online data curation. The central claim is that this unified recipe yields consistent outperformance over SigLIP baselines at all scales (ViT-B to 1B) on zero-shot classification, image-text retrieval, and VLM transfer tasks, plus substantial gains on localization and dense prediction. Additional variants support multiple resolutions while preserving native aspect ratios, and a more diverse de-biased data mixture improves multilingual understanding and fairness. Checkpoints are released at four sizes.

Significance. If the empirical results hold with proper controls, the work would provide a stronger, practical baseline for vision-language pretraining by showing additive benefits from combining established techniques. Improvements in localization/dense features and multilingual fairness address real limitations in current encoders, and the multi-scale releases enable cost-performance trade-offs. The approach of unifying prior methods into a single recipe could influence subsequent training pipelines, though its value depends on whether gains are attributable to the recipe rather than uncontrolled factors such as total compute or data volume.

major comments (1)
  1. The abstract asserts consistent outperformance and localization gains but provides no quantitative results, ablation studies, or details on experimental controls (e.g., matched data volume, training steps, or resolution); this makes it impossible to assess whether the reported improvements are load-bearing for the central claim or could be explained by confounding factors.
minor comments (2)
  1. Notation for the extended loss (captioning + self-supervised terms) should be defined explicitly, including weighting coefficients, to allow reproduction.
  2. Clarify how online data curation interacts with the de-biasing mixture; any overlap or filtering steps should be described to avoid ambiguity in the data pipeline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address the single major comment below and have prepared revisions to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: The abstract asserts consistent outperformance and localization gains but provides no quantitative results, ablation studies, or details on experimental controls (e.g., matched data volume, training steps, or resolution); this makes it impossible to assess whether the reported improvements are load-bearing for the central claim or could be explained by confounding factors.

    Authors: We agree that the abstract, due to its length constraints, does not contain specific quantitative results, ablation details, or explicit statements on experimental controls. The full manuscript addresses these points through quantitative comparisons across multiple tables and figures, ablation studies in Section 4 that isolate the contribution of each added component (captioning, self-supervised losses, and data curation), and Section 3 which describes the training protocol with matched data volumes, step counts, and resolutions relative to the SigLIP baselines. To make this immediately visible, we will revise the abstract to include a small number of key performance deltas and a brief reference to the controlled experimental setup. These changes ensure the central claim can be evaluated without requiring the reader to consult the full text first. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical recipe evaluated on external benchmarks

full rationale

The paper describes an empirical training recipe that extends the prior SigLIP objective with captioning pretraining, self-supervised losses, and online curation, then reports performance gains on standard zero-shot, retrieval, VLM transfer, localization, and dense-prediction benchmarks. No equations, uniqueness theorems, or first-principles derivations are present that could reduce a claimed result to a fitted parameter or self-referential definition. Self-citations to the original SigLIP work serve only as the baseline for comparison and do not carry the load of proving the new gains; those gains are measured against held-out test sets. The argument is therefore self-contained against external benchmarks and contains no circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

As an empirical scaling paper, the central claim rests on standard assumptions of deep learning optimization and data representativeness rather than new mathematical derivations.

free parameters (2)
  • loss weighting coefficients
    Weights balancing the original contrastive loss with added captioning and self-supervised terms are chosen during training.
  • data mixture proportions
    Proportions in the diverse multilingual data mixture including de-biasing are selected to achieve reported fairness gains.
axioms (2)
  • domain assumption ViT-based encoder architecture behaves consistently under the added objectives
    The paper assumes the base SigLIP architecture scales without modification when new losses are introduced.
  • domain assumption Online data curation selects representative samples without introducing selection bias
    Assumes the curation process improves quality without distorting the underlying data distribution.

pith-pipeline@v0.9.0 · 5571 in / 1351 out tokens · 69744 ms · 2026-05-10T15:44:04.883873+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

    cs.CR 2026-05 conditional novelty 8.0

    Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...

  2. Representation Fr\'echet Loss for Visual Generation

    cs.CV 2026-04 unverdicted novelty 8.0

    Fréchet Distance optimized as FD-loss in representation space by decoupling population size from batch size improves generator quality, enables one-step generation from multi-step models, and motivates a multi-represe...

  3. Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.

  4. EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

  5. Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

    cs.CL 2026-05 unverdicted novelty 7.0

    Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...

  6. CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

    cs.CV 2026-05 conditional novelty 7.0

    LiteLVLM prunes visual tokens for pixel grounding by reversing CLIP visual-text similarity to retain referent region tokens, outperforming prior methods by over 5% with 22% speedup and 2.3x memory reduction without an...

  7. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  8. VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference

    cs.CV 2026-05 unverdicted novelty 7.0

    VIP evolves text prompts using visual cues and saliency-aware aggregation inside dino.txt to deliver 1.4-8.4% higher mIoU on dense vision-language tasks with low overhead.

  9. Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

    cs.RO 2026-05 conditional novelty 7.0

    A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.

  10. LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 7.0

    LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.

  11. jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

    cs.CL 2026-05 unverdicted novelty 7.0

    Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.

  12. BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    BRIDGE uses separate main and subject paths plus a discrete gate on positional embeddings to improve local edits with coarse masks, raising local SigLIP2-T from 0.39 to 0.50 on its benchmark.

  13. Attention Transfer Is Not Universally Effective for Vision Transformers

    cs.CV 2026-05 accept novelty 7.0

    Attention transfer from ViT teachers succeeds for only 7 of 11 families and fails for the rest because of architectural mismatch between teacher and student.

  14. Attributions All the Way Down? The Metagame of Interpretability

    cs.LG 2026-05 unverdicted novelty 7.0

    Defines meta-attributions as directional second-order Shapley values on attribution methods, proves hierarchical decomposition of attributions, and demonstrates applications in language models, vision-language encoder...

  15. OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention

    cs.CV 2026-05 unverdicted novelty 7.0

    OpenGaFF combines a geometry-conditioned Gaussian Feature Field with codebook-guided attention to deliver more spatially coherent open-vocabulary 3D semantic segmentation than prior methods.

  16. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 7.0

    MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.

  17. Posterior Augmented Flow Matching

    cs.CV 2026-05 unverdicted novelty 7.0

    PAFM augments flow matching with an importance-sampled mixture over an approximate posterior of target completions, yielding an unbiased lower-variance estimator that improves FID by up to 3.4 on ImageNet and CC12M.

  18. Differentially Private Contrastive Learning via Bounding Group-level Contribution

    cs.CR 2026-04 unverdicted novelty 7.0

    DP-GCL improves differentially private contrastive learning by bounding group-level contributions through batch partitioning and intra-group augmentation, delivering 5.6% higher image classification accuracy and 20.1%...

  19. GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution

    cs.CV 2026-04 unverdicted novelty 7.0

    GramSR uses DINOv3 visual features instead of text captions to condition a one-step diffusion model for super-resolution via sequential pixel, semantic, and texture LoRA modules.

  20. StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition

    cs.GR 2026-04 unverdicted novelty 7.0

    StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.

  21. RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking

    cs.CV 2026-04 unverdicted novelty 7.0

    RSRCC is a new 126k-question benchmark for fine-grained remote sensing change question-answering, constructed via a hierarchical semi-supervised pipeline with retrieval-augmented Best-of-N ranking.

  22. Evaluating Remote Sensing Image Captions Beyond Metric Biases

    cs.CV 2026-04 unverdicted novelty 7.0

    Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA pe...

  23. Hybrid Latent Reasoning with Decoupled Policy Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.

  24. Coevolving Representations in Joint Image-Feature Diffusion

    cs.CV 2026-04 unverdicted novelty 7.0

    CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample ...

  25. Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes

    cs.CV 2026-04 unverdicted novelty 7.0

    Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.

  26. Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.

  27. UNIGEOCLIP: Unified Geospatial Contrastive Learning

    cs.CV 2026-04 unverdicted novelty 7.0

    UNIGEOCLIP creates a unified embedding for aerial imagery, street views, elevation, text, and coordinates via all-to-all contrastive alignment plus a scaled lat-long encoder, outperforming single-modality and coordina...

  28. Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

  29. Bottleneck Tokens for Unified Multimodal Retrieval

    cs.LG 2026-04 unverdicted novelty 7.0

    Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.

  30. RewardFlow: Generate Images by Optimizing What You Reward

    cs.CV 2026-04 unverdicted novelty 7.0

    RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.

  31. InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.

  32. Show Me the Infographic I Imagine: Intent-Aware Infographic Retrieval for Authoring Support

    cs.IR 2026-04 unverdicted novelty 7.0

    Presents a new retrieval system that enriches user queries with an intent taxonomy to improve matching of natural language descriptions to infographic designs and support authoring.

  33. Personalizing Text-to-Image Generation to Individual Taste

    cs.CV 2026-04 unverdicted novelty 7.0

    PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.

  34. A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

    cs.CV 2026-04 conditional novelty 7.0

    Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.

  35. StyleTextGen: Style-Conditioned Multilingual Scene Text Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    StyleTextGen proposes a dual-branch style encoder, text style consistency loss, and mask-guided inference to achieve superior style consistency and cross-lingual performance in multilingual scene text generation on a ...

  36. Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.

  37. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

  38. Elastic Attention Cores for Scalable Vision Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...

  39. Unlocking UML Class Diagram Understanding in Vision Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    A new UML class diagram VQA benchmark and 16k dataset enable LoRA fine-tuning to outperform Qwen 3.5 27B.

  40. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 conditional novelty 6.0

    Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.

  41. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 unverdicted novelty 6.0

    Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.

  42. Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation w...

  43. HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

    cs.CV 2026-05 unverdicted novelty 6.0

    A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...

  44. VISOR: A Vision-Language Model-based Test Oracle for Testing Robot

    cs.SE 2026-05 unverdicted novelty 6.0

    VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...

  45. How Mobile World Model Guides GUI Agents?

    cs.AI 2026-05 unverdicted novelty 6.0

    Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out...

  46. LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

    cs.CV 2026-05 unverdicted novelty 6.0

    LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...

  47. jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

    cs.CL 2026-05 unverdicted novelty 6.0

    GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...

  48. What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

  49. BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing

    cs.CV 2026-05 unverdicted novelty 6.0

    BRIDGE improves coarse-mask local image editing in DiT models by routing background and subject paths separately and using a discrete geometric gate on positional embeddings to reduce mask-shape bias.

  50. MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation

    cs.CV 2026-05 unverdicted novelty 6.0

    MSD-Score introduces multi-scale distributional scoring on von Mises-Fisher mixtures to evaluate image captions without references and reports state-of-the-art correlation with human judgments.

  51. ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

    cs.CV 2026-05 unverdicted novelty 6.0

    ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier i...

  52. Taming Outlier Tokens in Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.

  53. MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

    cs.SD 2026-05 accept novelty 6.0

    MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.

  54. Text-Conditional JEPA for Learning Semantically Rich Visual Representations

    cs.LG 2026-05 unverdicted novelty 6.0

    TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.

  55. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 6.0

    MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...

  56. Two-Pass Zero-Shot Temporal-Spatial Grounding of Rare Traffic Events in Surveillance Video

    cs.CV 2026-05 unverdicted novelty 6.0

    A two-pass pipeline with Qwen3-VL-Plus and Gemini 3.1 Flash-Lite achieves 0.539 accuracy on the ACCIDENT@CVPR 2026 benchmark of 2,027 real CCTV videos for zero-shot temporal-spatial grounding of traffic events.

  57. Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.

  58. End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

    cs.CV 2026-05 unverdicted novelty 6.0

    An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.

  59. Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

    cs.CV 2026-04 unverdicted novelty 6.0

    LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentatio...

  60. Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics

    cs.CV 2026-04 conditional novelty 6.0

    CLIP models understand 360-degree textual semantics via explicit identifiers but show limited comprehension of visual semantics under horizontal circular shifts, which a LoRA fine-tuning approach improves with a noted...

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 122 Pith papers · 8 internal anchors

  1. [1]

    Alabdulmohsin, X

    I. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. In NeurIPS, 2023

  2. [2]

    Alabdulmohsin, X

    I. Alabdulmohsin, X. Wang, A. P. Steiner, P. Goyal, A. D’Amour, and X. Zhai. Clip the bias: How useful is balancing data in multimodal learning? InICLR, 2024

  3. [3]

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P.Wang, J.Lin, C.Zhou, andJ.Zhou. Qwen- VL: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv:2308.12966, 2023

  4. [4]

    Barbu, D

    A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and B. Katz. Objectnet: A large-scale bias- controlled dataset for pushing the limits of object recognition models.NeurIPS, 2019

  5. [5]

    Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020

    L. Beyer, O. J. Hénaff, A. Kolesnikov, X. Zhai, and A. v. d. Oord. Are we done with ima- genet? arXiv:2006.07159, 2020

  6. [6]

    Beyer, P

    L. Beyer, P. Izmailov, A. Kolesnikov, M. Caron, S. Kornblith, X. Zhai, M. Min- derer, M. Tschannen, I. Alabdulmohsin, and F. Pavetic. Flexivit: One model for all patch sizes. InCVPR, 2023

  7. [7]

    PaliGemma: A versatile 3B VLM for transfer

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neu- mann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, 12 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, ...

  8. [8]

    Caesar, J

    H. Caesar, J. Uijlings, and V. Ferrari. Coco- stuff: Thing and stuff classes in context. In CVPR, 2018

  9. [9]

    Caron, H

    M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vi- sion transformers. In CVPR, pages 9650– 9660, 2021

  10. [10]

    X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, A. Kolesnikov, J. Puigcerver, N. Ding, K. Rong, H. Akbari, G. Mishra, L. Xue, A. Thapliyal, J. Bradbury, W. Kuo, M. Seyedhosseini, C. Jia, B. K. Ayan, C. Riquelme, A. Steiner, A. Angelova, X. Zhai, N. Houlsby, and R. Soricut. PaLI: A j...

  11. [11]

    S.Cho, H.Shin, S.Hong, A.Arnab, P.H.Seo, and S. Kim. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. In CVPR, pages 4113–4123, 2024

  12. [12]

    Dehghani, B

    M. Dehghani, B. Mustafa, J. Djolonga, J. Heek, M. Minderer, M. Caron, A. Steiner, J. Puigcerver, R. Geirhos, I. M. Alabdul- mohsin, et al. Patch n’pack: NaViT, a vi- sion transformer for any aspect ratio and resolution. NeurIPS, 2024

  13. [13]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hi- erarchical image database. InCVPR, pages 248–255, 2009

  14. [14]

    J. Ding, N. Xue, G.-S. Xia, and D. Dai. De- coupling zero-shot semantic segmentation. In CVPR, pages 11583–11592, 2022

  15. [15]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transform- ers for image recognition at scale. InICLR, 2021

  16. [16]

    Evans, N

    T. Evans, N. Parthasarathy, H. Merzic, and O. J. Henaff. Data curation via joint exam- ple selection further accelerates multimodal learning. In NeurIPS Datasets and Bench- marks Track, 2024

  17. [17]

    Everingham, L

    M. Everingham, L. Van Gool, C. K. Williams, J.Winn,andA.Zisserman. Thepascalvisual object classes (voc) challenge.IJCV, 2010

  18. [18]

    L. Fan, D. Krishnan, P. Isola, D. Katabi, and Y. Tian. Improving clip training with lan- guage rewrites. NeurIPS, pages 35544– 35575, 2023

  19. [19]

    A. Fang, A. M. Jose, A. Jain, L. Schmidt, A. T. Toshev, and V. Shankar. Data filtering networks. InICLR, 2024

  20. [20]

    E. Fini, M. Shukor, X. Li, P. Dufter, M. Klein, D. Haldimann, S. Aitharaju, V. G. T. da Costa, L. Béthune, Z. Gan, A. T. Toshev, M. Eichner, M. Nabi, Y. Yang, J. M. Susskind, and A. El-Nouby. Multimodal autoregres- sive pre-training of large vision encoders. arXiv:2411.14402, 2024

  21. [21]

    S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G.Smyrnis, T.Nguyen, R.Marten, M.Worts- man, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multi- modal datasets.NeurIPS, 36, 2024

  22. [22]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team. Gemma: Open models based on gemini research and technology. arXiv:2403.08295, 2024

  23. [23]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team. Gemma 2: Improving open language models at a practical size. arXiv:2408.00118, 2024

  24. [24]

    Introduction to Cloud TPU

    Google Cloud. Introduction to Cloud TPU. https://cloud.google.com/ tpu/docs/intro-to-tpu, 20xx. Ac- cessed: 2024-07-04. 13 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

  25. [25]

    Gupta, P

    A. Gupta, P. Dollar, and R. Girshick. Lvis: A dataset for large vocabulary instance seg- mentation. In CVPR, pages 5356–5364, 2019

  26. [26]

    T.-Y. Hsu, C. L. Giles, and T.-H. Huang. Sc- icap: Generating captions for scientific fig- ures. arXiv:2110.11624, 2021

  27. [27]

    Ilharco, M

    G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V. Shankar, H. Namkoong, J. Miller, H. Ha- jishirzi, A. Farhadi, and L. Schmidt. Open- CLIP, 2021

  28. [28]

    C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig. Scaling up visual and vision- language representation learning with noisy text supervision. InICML, 2021

  29. [29]

    S.Kazemzadeh,V.Ordonez,M.Matten,and T. Berg. ReferItGame: Referring to objects inphotographsofnaturalscenes. In EMNLP, Oct. 2014

  30. [30]

    W. Kuo, Y. Cui, X. Gu, A. Piergiovanni, and A. Angelova. Open-vocabulary object de- tection upon frozen vision and language models. InICLR, 2023

  31. [31]

    Z. Lai, H. Zhang, B. Zhang, W. Wu, H. Bai, A. Timofeev, X. Du, Z. Gan, J. Shan, C.-N. Chuah, Y. Yang, and M. Cao. VeCLIP: Im- provingcliptrainingviavisual-enrichedcap- tions. arXiv:2310.07699, 2024

  32. [32]

    J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language-image pre- training with frozen image encoders and large language models. InICML, 2023

  33. [33]

    X. Li, Z. Wang, and C. Xie. Clipa-v2: Scal- ing clip training with 81.1% zero-shot im- agenet accuracy within a $10,000 budget; an extra $4,000 unlocks 81.8% accuracy. arXiv:2306.15658, 2023

  34. [34]

    T. Lin, M. Maire, S. J. Belongie, L. D. Bour- dev, R. B. Girshick, J. Hays, P. Perona, D. Ra- manan, P. Doll’a r, and C. L. Zitnick. Mi- crosoft COCO: common objects in context. arXiv:1405.0312, 2014

  35. [35]

    H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. InNeurIPS, 2023

  36. [36]

    S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. ICDAR 2023 com- petition on hierarchical text detection and recognition. InICDAR, 2023

  37. [37]

    Decoupled Weight Decay Regularization

    I. Loshchilov, F. Hutter, et al. Fixing weight decayregularizationinadam. arXivpreprint arXiv:1711.05101, 5, 2017

  38. [38]

    Maninis, K

    K.-K. Maninis, K. Chen, S. Ghosh, A. Karpur, K. Chen, Y. Xia, B. Cao, D. Salz, G. Han, J.Dlabal,etal. TIPS:Text-imagepretraining with spatial awareness. InICLR, 2025

  39. [39]

    Mm1: Methods, analysis & insights from multimodal llm pre-training

    B.McKinzie, Z.Gan, J.Fauconnier, S.Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers, A. Belyi, H. Zhang, K. Singh, D. Kang, A. Jain, H. Hè, M. Schwarzer, T. Gunter, X. Kong, A. Zhang, J. Wang, C. Wang, N. Du, T. Lei, S. Wiseman, G. Yin, M. Lee, Z. Wang, R. Pang, P. Grasch, A. To- shev, and Y. Yang. MM1: methods, anal- ysis & insights from mul...

  40. [40]

    Minderer, A

    M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovit- skiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, et al. Simple open-vocabulary ob- ject detection. In ECCV, pages 728–755, 2022

  41. [41]

    Minderer, A

    M. Minderer, A. A. Gritsenko, and N. Houlsby. Scaling open-vocabulary object detection. InNeurIPS, 2023

  42. [42]

    Sharma, A

    S.Mindermann, J.M.Brauner, M.T.Razzak, M. Sharma, A. Kirsch, W. Xu, B. Höltgen, A. N. Gomez, A. Morisot, S. Farquhar, et al. Prioritized training on points that are learn- able, worth learning, and not yet learnt. In ICML, pages 15630–15649, 2022

  43. [43]

    Mottaghi, X

    R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.- W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semanticsegmentationinthewild. In CVPR, 2014. 14 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

  44. [44]

    N. Mu, A. Kirillov, D. Wagner, and S. Xie. SLIP: Self-supervision meets language- image pre-training. In ECCV, pages 529– 544, 2022

  45. [45]

    M. F. Naeem, Y. Xian, X. Zhai, L. Hoyer, L. Van Gool, and F. Tombari. SILC: Improv- ing vision language pretraining with self- distillation. InECCV, pages 38–55, 2024

  46. [46]

    Nguyen, S

    T. Nguyen, S. Y. Gadre, G. Ilharco, S. Oh, and L. Schmidt. Improving multimodal datasets with image captioning.NeurIPS, 36, 2024

  47. [47]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M.Szafraniec, V.Khalidov, P.Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Di- nov2: Learning robust visual features with- out supervision.TMLR, 2024

  48. [48]

    Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei. Kosmos-2: Grounding multimodal large language mod- els to the world.arXiv:2306.14824, 2023

  49. [49]

    Pouget, L

    A. Pouget, L. Beyer, E. Bugliarello, X. Wang, A. P. Steiner, X. Zhai, and I. Alabdulmohsin. No filter: Cultural and socioeconomic diver- sityin contrastive vision-language models. arXiv:2405.13777, 2024

  50. [50]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable vi- sual models from natural language supervi- sion. InICML, 2021

  51. [51]

    V. V. Ramaswamy, S. Y. Lin, D. Zhao, A. Ad- cock, L. van der Maaten, D. Ghadiyaram, and O. Russakovsky. Geode: a geographi- cally diverse evaluation dataset for object recognition. NeurIPS, 36, 2024

  52. [52]

    Ranftl, A

    R. Ranftl, A. Bochkovskiy, and V. Koltun. Vision transformers for dense prediction. In CVPR, pages 12179–12188, 2021

  53. [53]

    Recht, R

    B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do imagenet classifiers gen- eralize to imagenet? InICML, pages 5389– 5400, 2019

  54. [54]

    W. A. G. Rojas, S. Diamos, K. R. Kini, D. Kan- ter, V. J. Reddi, and C. Coleman. The dollar street dataset: Images representing the geo- graphic and socioeconomic diversity of the world. InNeurIPS Datasets and Benchmarks Track, 2022

  55. [55]

    Sidorov, R

    O. Sidorov, R. Hu, M. Rohrbach, and A. Singh. TextCaps: A dataset for image captioning with reading comprehension. In ECCV, 2020

  56. [56]

    A.Steiner,A.S.Pinto,M.Tschannen,D.Key- sers, X. Wang, Y. Bitton, A. Gritsenko, M. Minderer, A. Sherbondy, S. Long, et al. Paligemma 2: A family of versatile vlms for transfer. arXiv:2412.03555, 2024

  57. [57]

    Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao. EVA-CLIP: Improved training techniques for clip at scale.arXiv:2303.15389, 2023

  58. [58]

    A. V. Thapliyal, J. Pont Tuset, X. Chen, and R. Soricut. Crossmodal-3600: A massively multilingual multimodal evaluation dataset. In EMNLP, 2022

  59. [59]

    S. Tong, E. Brown, P. Wu, S. Woo, M. Midde- pogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, A. Wang, R. Fergus, Y. LeCun, and S. Xie. Cambrian-1: A Fully Open, Vision- Centric Exploration of Multimodal LLMs. arXiv:2406.16860, 2024

  60. [60]

    Houlsby, and L

    M.Tschannen,M.Kumar,A.Steiner,X.Zhai, N. Houlsby, and L. Beyer. Image captioners are scalable vision learners too. InNeurIPS, 2023

  61. [61]

    Udandarao, N

    V. Udandarao, N. Parthasarathy, M. F. Naeem, T. Evans, S. Albanie, F. Tombari, Y. Xian, A. Tonioni, and O. J. Hénaff. Active data curation effectively distills large-scale multimodal models. arXiv:2411.18674, 2024

  62. [62]

    B. Wan, M. Tschannen, Y. Xian, F. Pavetic, I. Alabdulmohsin, X. Wang, A. S. Pinto, A. Steiner, L. Beyer, and X. Zhai. LocCa: Visual pretraining with location-aware cap- tioners. InNeurIPS, 2024. 15 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

  63. [63]

    B. Wang, G. Li, X. Zhou, Z. Chen, T. Gross- man, and Y. Li. Screen2words: Automatic mobile ui summarization with multimodal learning. In Symposium on User Interface Software and Technology, 2021

  64. [64]

    Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao. SimVLM: Simple visual lan- guage model pretraining with weak super- vision. InICLR, 2022

  65. [65]

    Weyand, A

    T. Weyand, A. Araujo, B. Cao, and J. Sim. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. InCVPR, pages 2575–2584, 2020

  66. [66]

    H. Xu, S. Xie, X. Tan, P.-Y. Huang, R. Howes, V. Sharma, S.-W. Li, G. Ghosh, L. Zettle- moyer, and C. Feichtenhofer. Demystifying clip data. InICLR, 2024

  67. [67]

    J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu. CoCa: Con- trastive captioners are image-text founda- tion models.TMLR, 2022

  68. [68]

    L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. InECCV, pages 69–85, 2016

  69. [69]

    X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers.CVPR, 2022

  70. [70]

    X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, and L. Beyer. Lit: Zero-shot transfer with locked-image text tuning. InCVPR, 2022

  71. [71]

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. InICCV, 2023

  72. [72]

    Y. Zhao, A. Gu, R. Varma, L. Luo, C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Da- mania, B. Nguyen, G. Chauhan, Y. Hao, A.Mathews, andS.Li. PytorchFSDP:experi- ences on scaling fully sharded data parallel. VLDB, 2023

  73. [73]

    B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Bar- riuso, and A. Torralba. Scene parsing through ade20k dataset. InCVPR, 2017

  74. [74]

    B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba. Semantic un- derstanding of scenes through the ade20k dataset. IJCV, 2019

  75. [75]

    J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong. Image BERT pre- training with online tokenizer. In ICLR, 2022. 16 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Appendix A. Full PaliGemma results Large 224/256px So400m/14 224px So400m 384px SigLIP AIMv2 SigLIP2 SigL...