DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
super hub Mixed citations
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
Mixed citation behavior. Most common role is background (57%).
abstract
We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and trans
authors
co-cited works
representative citing papers
Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
Fréchet Distance optimized as FD-loss in representation space by decoupling population size from batch size improves generator quality, enables one-step generation from multi-step models, and motivates a multi-representation metric FDr^k.
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.
ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.
GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.
DMV-Bench introduces the first interactive benchmark for multimodal-agent visual memory via incidental cue injection on product images, and DualMem, a parallel visual-verbal memory architecture, outperforms baselines across chain lengths 5-50 on two VLMs.
The normalized inverse-scale direction of LayerNorm's affine parameters is an exact algebraic kernel of the post-final-norm centred activation covariance for any input distribution in LayerNorm transformers.
FARM creates an open-vocabulary relational spatial memory that improves object retrieval recall by 164-224% over prior methods on 44k language queries across 67 scenes while running at 5-10 Hz.
SelfBootTok decomposes image tokens into global and local groups via self-bootstrapped learning, enabling generators to use only global tokens for ~40% less computation and a new SOTA gFID of 1.56 with 64 tokens.
SAS reveals that medical images retain richer structural information than paired clinical reports in VLMs, an asymmetry hidden from symmetric metrics, with strongest correlation to retrieval performance.
VPE inserts an internal autoregressive visual semantic token generation step to guide image token production in unified models, reporting faster convergence, higher quality, and superior editing preservation (PSNR 26.76 vs 19.92) versus external alternatives.
VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
Sparse 2.5D trajectory transformers with masked pretraining reach 45% top-1 on Something-Something V2 and 54% on EPIC-Kitchens while improving fusion with DINOv2 and V-JEPA by up to 8.7 points.
X-Stream benchmark shows SOTA MLLMs score ~50% on concurrent multi-stream tasks and lack proactive ability, using a dual-verification pipeline to avoid single-stream bias.
HiTokSR uses a coarse-to-fine hierarchical tokenizer with frequency-aware sub-codebooks, vision foundation model priors, and index perturbation to achieve state-of-the-art perceptual quality and fidelity in real-world image super-resolution.
HakushoBench provides 2,053 Japanese chart and table images from governmental white papers with QA pairs, showing open-weight VLMs reach only 58.6% accuracy versus higher proprietary performance.
Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.
3DCodeBench is a new benchmark evaluating 12 VLMs on translating multimodal prompts into procedural 3D modeling code, paired with 3DCodeArena for human preference rankings.
SVI-Bench provides 35K hours of sports video with 9 tasks across four cognitive levels, revealing models drop from ~74% on action QA to 5% on agentic evidence integration.
PInVerify is a new offline embodied benchmark for active instance verification that supplies multi-view captures and 6-sector navigation topology, with MLLM baselines reaching 85.6% after fine-tuning but showing no reliable benefit from tested next-best-view strategies.
PRISM replaces Markov or fixed-window intention models in multi-intention IRL with a recurrent network, proving an exact EM decomposition into closed-form per-intention reward problems and reporting highest held-out likelihood on gridworld, mouse, and robotic tasks.
citing papers explorer
-
ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.
-
MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
MMLandmarks supplies 197k aerial and 329k ground images plus text and GPS for 18,557 landmarks to benchmark multimodal geo-spatial understanding.
-
MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors
MoonSeg3R is the first method for online monocular 3D instance segmentation, achieving performance competitive with RGB-D systems by using CUT3R priors for geometric consistency and temporal query memory.
-
SoccerMaster: A Vision Foundation Model for Soccer Understanding
SoccerMaster is the first soccer-specific vision foundation model that unifies tasks from player detection to event classification via multi-task pretraining and outperforms task-specific models on downstream evaluations.
-
PowerCLIP: Powerset Alignment for Contrastive Pre-Training
PowerCLIP improves CLIP-style models by exhaustively aligning powersets of image regions to textual parse trees via efficient non-linear aggregators that approximate the full combinatorial loss.
-
TRANSPORTER: Transferring Visual Semantics from VLM Manifolds
TRANSPORTER generates videos from VLM logits using optimal transport to interpret model predictions on object attributes, actions, and scenes.
-
CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab?
CardioBench is a new public benchmark that standardizes eight echocardiography datasets into four regression and five classification tasks to evaluate foundation model generalization.
-
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
-
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.
-
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping
A contrastive multimodal framework augments satellite-audio datasets with vision-language model sound descriptions to learn shared soundscape concepts for zero-shot retrieval and synthesis.
-
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
-
SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models
SigLino distills SigLIP2 and DINOv3 into efficient vision models via asymmetric relation-knowledge distillation, token-balanced batching, and hierarchical data sampling on a new 200M-image corpus, yielding better transfer to grounding VLMs than training from scratch.
-
Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding
Chorus pretrains a shared 3D Gaussian scene encoder via multi-teacher distillation to capture holistic features from high-level semantics to fine-grained structure, with strong transfer on segmentation and point-cloud tasks using far fewer scenes.
-
PhotoFramer: Multi-modal Image Composition Instruction
PhotoFramer is a multi-modal model that jointly produces textual composition instructions and illustrative corrected images from poorly framed inputs.
-
RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models
RADSeg adapts the RADIO model with targeted enhancements to deliver 6-30% higher mIoU in zero-shot OVSS while using 2.5x fewer parameters and running 3.95x faster than prior large-model combinations.
-
Cambrian-S: Towards Spatial Supersensing in Video
Cambrian-S introduces VSI-SUPER benchmarks for long-horizon spatial recall and counting, shows data scaling yields 30% gains on existing tests, and demonstrates a self-supervised next-latent predictor using surprise outperforms baselines on the new spatial supersensing tasks.
-
Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail
Alpamayo-R1 introduces a VLA model with a Chain of Causation dataset and multi-stage SFT-plus-RL training that reports 12% better planning accuracy and 35% fewer close encounters versus trajectory-only baselines in driving tasks.
-
VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
VFM-VAE uses a frozen VFM directly as LDM tokenizer via a custom decoder, reaching gFID 2.22 in 80 epochs and 1.62 after 640 epochs.
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.
-
Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering
PMSR progressively constructs structured reasoning trajectories with dual-scope queries and compositional reasoning to improve knowledge acquisition and answer accuracy in knowledge-intensive VQA.
-
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
Franca introduces nested Matryoshka clustering and positional disentanglement in a transparent SSL pipeline to deliver open-source vision models competitive with closed proprietary systems.
-
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.
-
ImgEdit: A Unified Image Editing Dataset and Benchmark
ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
-
ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers
ViTaPEs uses two-stage positional encodings in a multimodal transformer to learn task-agnostic visuotactile representations that outperform baselines on recognition tasks, show zero-shot generalization, and improve robotic grasp success prediction.
-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
-
FLARE: Robot Learning with Implicit World Modeling
FLARE integrates predictive latent world modeling into diffusion transformer policies for robots, delivering up to 26% gains on multitask manipulation benchmarks and enabling co-training with action-free human videos.
-
Perception Encoder: The best visual embeddings are not at the output of the network
Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.
-
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
-
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
LLaVA-OneVision-1.5 provides open datasets, code, and models that match or exceed closed competitors on 27 benchmarks at low cost through curated data and efficient training.
-
Qwen-Image Technical Report
Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive encoding for editing consistency.
-
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
-
Emerging Properties in Unified Multimodal Pretraining
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
-
Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
Zero-shot VLMs reach at most 62% accuracy on agricultural classification tasks while supervised models like YOLO11 perform markedly higher, indicating they are not ready to replace task-specific systems.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
Multilingual Vision-Language Models, A Survey
The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.
- Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer