pith. sign in

super hub Mixed citations

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Mixed citation behavior. Most common role is background (57%).

317 Pith papers citing it
Background 57% of classified citations
abstract

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

hub tools

citation-role summary

background 40 method 23 baseline 3 dataset 1

citation-polarity summary

claims ledger

  • abstract We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and trans

authors

co-cited works

clear filters

representative citing papers

Is Dimensionality a Barrier for Retrieval Models?

cs.LG · 2026-05-22 · unverdicted · novelty 8.0

Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.

Representation Fr\'echet Loss for Visual Generation

cs.CV · 2026-04-30 · unverdicted · novelty 8.0

Fréchet Distance optimized as FD-loss in representation space by decoupling population size from batch size improves generator quality, enables one-step generation from multi-step models, and motivates a multi-representation metric FDr^k.

GEAR: Guided End-to-End AutoRegression for Image Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.

OctoSense: Self-Supervised Learning for Multimodal Robot Perception

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

OctoSense supplies a large multimodal robotics dataset and a late-fusion masked autoencoder that runs fast and outperforms image-only models on optical flow, depth, segmentation, and ego-motion tasks while remaining robust under sensor degradation.

SemCEB: A Cardinality Estimation Benchmark for Semantic Operators

cs.DB · 2026-06-22 · unverdicted · novelty 7.0

SemCEB is the first benchmark for cardinality estimation over semantic operators, evaluating sampling methods and Semantic Histograms on accuracy, cost, latency, and memory using 102 queries on a real-world dataset.

FARM: Find Anything using Relational Spatial Memory

cs.RO · 2026-06-13 · unverdicted · novelty 7.0

FARM creates an open-vocabulary relational spatial memory that improves object retrieval recall by 164-224% over prior methods on 44k language queries across 67 scenes while running at 5-10 Hz.

Imagine Before You Draw: Visual Prompt Engineering for Image Generation

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

VPE inserts an internal autoregressive visual semantic token generation step to guide image token production in unified models, reporting faster convergence, higher quality, and superior editing preservation (PSNR 26.76 vs 19.92) versus external alternatives.

TrAction: Action Recognition with Sparse Trajectories

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Sparse 2.5D trajectory transformers with masked pretraining reach 45% top-1 on Something-Something V2 and 54% on EPIC-Kitchens while improving fusion with DINOv2 and V-JEPA by up to 8.7 points.

citing papers explorer

Showing 16 of 16 citing papers after filters.