super hub Mixed citations

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Alexey Gritsenko, Ibrahim Alabdulmohsin, Michael Tschannen, Muhammad Ferjad Naeem, Nikhil Parthasarathy, Xiao Wang · 2025 · cs.CV · arXiv 2502.14786

Mixed citation behavior. Most common role is background (57%).

317 Pith papers citing it

Background 57% of classified citations

open full Pith review browse 317 citing papers more from Alexey Gritsenko arXiv PDF

abstract

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 40 method 23 baseline 3 dataset 1

citation-polarity summary

background 38 use method 23 baseline 3 unclear 2 use dataset 1

claims ledger

abstract We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and trans

authors

Alexey Gritsenko Ibrahim Alabdulmohsin Michael Tschannen Muhammad Ferjad Naeem Nikhil Parthasarathy Xiao Wang

co-cited works

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Is Dimensionality a Barrier for Retrieval Models?

cs.LG · 2026-05-22 · unverdicted · novelty 8.0

Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

Representation Fr\'echet Loss for Visual Generation

cs.CV · 2026-04-30 · unverdicted · novelty 8.0

Fréchet Distance optimized as FD-loss in representation space by decoupling population size from batch size improves generator quality, enables one-step generation from multi-step models, and motivates a multi-representation metric FDr^k.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

cs.CV · 2025-12-09 · unverdicted · novelty 8.0

ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.

GEAR: Guided End-to-End AutoRegression for Image Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.

DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

DMV-Bench introduces the first interactive benchmark for multimodal-agent visual memory via incidental cue injection on product images, and DualMem, a parallel visual-verbal memory architecture, outperforms baselines across chain lengths 5-50 on two VLMs.

OctoSense: Self-Supervised Learning for Multimodal Robot Perception

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

OctoSense supplies a large multimodal robotics dataset and a late-fusion masked autoencoder that runs fast and outperforms image-only models on optical flow, depth, segmentation, and ego-motion tasks while remaining robust under sensor degradation.

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

cs.CV · 2026-06-22 · conditional · novelty 7.0

HeRA aligns least-aligned attention heads in MLLMs using an MKNN-based contrastive objective to preserve cross-modal topological structure, yielding gains on vision-centric tasks and reduced hallucinations across 18 benchmarks.

SemCEB: A Cardinality Estimation Benchmark for Semantic Operators

cs.DB · 2026-06-22 · unverdicted · novelty 7.0

SemCEB is the first benchmark for cardinality estimation over semantic operators, evaluating sampling methods and Semantic Histograms on accuracy, cost, latency, and memory using 102 queries on a real-world dataset.

NEST: Narrative Event Structures in Time for Long Video Understanding

cs.CV · 2026-06-18 · unverdicted · novelty 7.0

NEST is a new benchmark dataset for narrative event structures in long videos, with baselines reporting ETD below 8%, EL under 6%, EAE below 11%, and ERE at 35-44% F1.

Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

The normalized inverse-scale direction of LayerNorm's affine parameters is an exact algebraic kernel of the post-final-norm centred activation covariance for any input distribution in LayerNorm transformers.

FARM: Find Anything using Relational Spatial Memory

cs.RO · 2026-06-13 · unverdicted · novelty 7.0

FARM creates an open-vocabulary relational spatial memory that improves object retrieval recall by 164-224% over prior methods on 44k language queries across 67 scenes while running at 5-10 Hz.

Balancing Image Compression and Generation with Bootstrapped Tokenization

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

SelfBootTok decomposes image tokens into global and local groups via self-bootstrapped learning, enabling generators to use only global tokens for ~40% less computation and a new SOTA gFID of 1.56 with 64 tokens.

Beyond Symmetric Alignment: Spectral Diagnostics of Modality Imbalance in Vision-Language Models in the Medical Domain

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

SAS reveals that medical images retain richer structural information than paired clinical reports in VLMs, an asymmetry hidden from symmetric metrics, with strongest correlation to retrieval performance.

Imagine Before You Draw: Visual Prompt Engineering for Image Generation

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

VPE inserts an internal autoregressive visual semantic token generation step to guide image token production in unified models, reporting faster convergence, higher quality, and superior editing preservation (PSNR 26.76 vs 19.92) versus external alternatives.

Benchmarking Visual State Tracking in Multimodal Video Understanding

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.

TrAction: Action Recognition with Sparse Trajectories

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Sparse 2.5D trajectory transformers with masked pretraining reach 45% top-1 on Something-Something V2 and 54% on EPIC-Kitchens while improving fusion with DINOv2 and V-JEPA by up to 8.7 points.

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

cs.CV · 2026-06-01 · unverdicted · novelty 7.0 · 2 refs

X-Stream benchmark shows SOTA MLLMs score ~50% on concurrent multi-stream tasks and lack proactive ability, using a dual-verification pipeline to avoid single-stream bias.

HiTokSR: A Coarse-to-Fine Tokenizer with Hierarchical Codebooks for High-Fidelity Real-World Image Super-Resolution

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

HiTokSR uses a coarse-to-fine hierarchical tokenizer with frequency-aware sub-codebooks, vision foundation model priors, and index perturbation to achieve state-of-the-art perceptual quality and fidelity in real-world image super-resolution.

HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers

cs.CV · 2026-05-31 · accept · novelty 7.0

HakushoBench provides 2,053 Japanese chart and table images from governmental white papers with QA pairs, showing open-weight VLMs reach only 58.6% accuracy versus higher proprietary performance.

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.

citing papers explorer

Showing 16 of 16 citing papers after filters.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 139 · internal anchor
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 62 · internal anchor
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Personalizing Text-to-Image Generation to Individual Taste cs.CV · 2026-04-08 · unverdicted · none · ref 52 · internal anchor
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
Cambrian-P: Pose-Grounded Video Understanding cs.CV · 2026-05-21 · unverdicted · none · ref 93 · internal anchor
Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs? cs.CV · 2026-05-09 · unverdicted · none · ref 40 · internal anchor
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction cs.CV · 2026-04-13 · unverdicted · none · ref 67 · internal anchor
Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models cs.CV · 2026-05-07 · unverdicted · none · ref 61 · internal anchor
Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision cs.CV · 2026-05-07 · unverdicted · none · ref 51 · internal anchor
Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection cs.CV · 2026-04-23 · unverdicted · none · ref 56 · internal anchor
UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.
Gated Memory Policy cs.RO · 2026-04-21 · unverdicted · none · ref 50 · internal anchor
GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System cs.CV · 2026-04-15 · unverdicted · none · ref 38 · 2 links · internal anchor
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
How Mobile World Model Guides GUI Agents? cs.AI · 2026-05-11 · unverdicted · none · ref 23 · 2 links · internal anchor
World models trained on delta text, full text, diffusion images, and renderable code achieve SoTA on two benchmarks and improve downstream GUI agent performance on three mobile datasets with modality-specific strengths.
A Real-Calibrated Synthetic-First Data Engine eess.IV · 2026-05-10 · unverdicted · none · ref 25 · internal anchor
A data curation pipeline using diffusion-generated synthetic images improves pose estimation when added to real data but underperforms when used without real anchors.
HyLaR: Hybrid Latent Reasoning with Decoupled Policy Optimization cs.CV · 2026-04-22 · unreviewed · ref 30 · internal anchor
OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation cs.CV · 2026-04-20 · unreviewed · ref 44 · internal anchor
AssemLM: A Spatial Reasoning Multimodal Large Language Model for Robotic Assembly cs.RO · 2026-04-10 · unreviewed · ref 43 · internal anchor

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer