DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
super hub Mixed citations
DINOv2: Learning Robust Visual Features without Supervision
Mixed citation behavior. Most common role is background (44%).
abstract
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques
authors
co-cited works
representative citing papers
WildBox provides over 237k 3D wildlife annotations from drone video and benchmarks reveal zero-shot 3D detection at 0 AP but fine-tuned performance of 8.68 AP-BEV and 13.17 AP3D, with depth estimation causing most errors.
X-Palm supplies the first paired multispectral-to-smartphone palmprint dataset with broad real-world variability to support cross-domain biometric authentication.
Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.
A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.
Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Introduces VICIS task and training framework for inferring visual concepts from image sets, with experiments showing better accuracy, diversity, and generalization than standard VLMs on synthetic and ImageNet data.
InvSplat is a feed-forward multi-view model that predicts 3D Gaussians augmented with intrinsic material attributes for inverse rendering and relighting.
The subspace intervention framework reveals that pre-training objectives shape how ViTs encode geometric information in compressible low-rank subspaces, with peak precision at intermediate layers.
A systematic survey unifies presentation, digital injection, and GenAI synthesis attacks on identity documents, audits datasets for a reality gap, identifies SDGI in multimodal models, and reports APCER above 25% for top models on synthetic IDs.
A training-free prototype memory-guided framework for multi-class prenatal ultrasound anomaly classification and localization using few reference images per class, validated on a 9-category multi-center dataset.
EPO is a trackless, edge-map-alignment framework that refines pose estimates from 3D foundation models and matches or exceeds bundle-adjustment performance with substantially lower runtime and memory use.
GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.
WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.
AnyMatch synthesizes large-scale geometrically consistent multi-modal image pairs from single-view images, enabling fine-tuned matching networks to achieve substantial gains on benchmarks.
A new dataset of 220k+ cross-view pairs and a single-stage geometry-aware model GAGeo based on the π³ 3D foundation model outperforms prior methods on object geo-localization with strong generalization and zero-shot ground-to-drone capability.
First complete digital unwrapping and reading of a Herculaneum papyrus scroll (PHerc. 1667) via synchrotron X-ray CT, virtual unrolling, and machine learning.
Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.
citing papers explorer
-
DataComp-VLM: Improved Open Datasets for Vision-Language Models
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
-
WildBox: A Dataset and Benchmark for Aerial Monocular 3D Detection of African Savanna Wildlife
WildBox provides over 237k 3D wildlife annotations from drone video and benchmarks reveal zero-shot 3D detection at 0 AP but fine-tuned performance of 8.68 AP-BEV and 13.17 AP3D, with depth estimation causing most errors.
-
X-Palm: Paired Multispectral-to-Smartphone Dataset for Cross-Domain Palmprint Authentication
X-Palm supplies the first paired multispectral-to-smartphone palmprint dataset with broad real-world variability to support cross-domain biometric authentication.
-
Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects
Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.
-
CalibAnyView: Beyond Single-View Camera Calibration in the Wild
A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.
-
Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation
Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.
-
On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
-
neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing
neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
-
Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
-
Show Me Examples: Inferring Visual Concepts from Image Sets
Introduces VICIS task and training framework for inferring visual concepts from image sets, with experiments showing better accuracy, diversity, and generalization than standard VLMs on synthetic and ImageNet data.
-
InvSplat: Inverse Feed-Forward Scene Splatting
InvSplat is a feed-forward multi-view model that predicts 3D Gaussians augmented with intrinsic material attributes for inverse rendering and relighting.
-
Understanding Geometric Representations in Self-Supervised Vision Transformers via Subspace Intervention
The subspace intervention framework reveals that pre-training objectives shape how ViTs encode geometric information in compressible low-rank subspaces, with peak precision at intermediate layers.
-
From Forgeries to Foundation Models: A Systematic Survey of Identity Document Attack and Detection
A systematic survey unifies presentation, digital injection, and GenAI synthesis attacks on identity documents, audits datasets for a reality gap, identifies SDGI in multimodal models, and reports APCER above 25% for top models on synthetic IDs.
-
Prototype Memory-Guided Training-Free Anomaly Classification and Localization in Prenatal Ultrasound
A training-free prototype memory-guided framework for multi-class prenatal ultrasound anomaly classification and localization using few reference images per class, validated on a 9-category multi-center dataset.
-
EPO: Boosting 3D Foundation Models with Edge-based Pose Optimization
EPO is a trackless, edge-map-alignment framework that refines pose estimates from 3D foundation models and matches or exceeds bundle-adjustment performance with substantially lower runtime and memory use.
-
GEAR: Guided End-to-End AutoRegression for Image Synthesis
GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.
-
WarpHammer: Densifying Scene Warps with 3D Object Priors for Extreme View Synthesis
WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.
-
AnyMatch: Supercharging Universal Multi-Modal Image Matching with Large-Scale Single-View Images
AnyMatch synthesizes large-scale geometrically consistent multi-modal image pairs from single-view images, enabling fine-tuned matching networks to achieve substantial gains on benchmarks.
-
Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization
A new dataset of 220k+ cross-view pairs and a single-stage geometry-aware model GAGeo based on the π³ 3D foundation model outperforms prior methods on object geo-localization with strong generalization and zero-shot ground-to-drone capability.
-
Complete virtual unwrapping and reading of a rolled Herculaneum papyrus
First complete digital unwrapping and reading of a Herculaneum papyrus scroll (PHerc. 1667) via synchrotron X-ray CT, virtual unrolling, and machine learning.
-
Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors
Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.
-
A Unified Framework for Vision Transformers Equivariant to Discrete Subgroups of $\mathrm{O}(2)$
Constructs G-equivariant ViTs for arbitrary discrete G ≤ O(2), proves H ≤ G implies G-models embed into H-models and single-head equivariant attention realizes all ordinary G-equivariant maps, introduces D6 hexagonal model, and reports preliminary accuracy gains on PatternNet in low-data regimes.
-
Learning 1-Bit LiDAR-based Localization with Auxiliary Objective
BiLoc is the first binary neural network framework for 6-DoF LiDAR pose estimation that uses an auxiliary objective to adaptively regulate information retention and achieve SOTA among BNNs on large outdoor datasets.
-
Scene and Human in One World: Reconstruction in a Feedforward Pass
SHOW is a mask-promptable framework coupling feed-forward scene reconstruction with human mesh recovery in a unified metric space to resolve scale ambiguity and improve human-scene alignment from monocular video.
-
MemoBench: Benchmarking World Modeling in Dynamically Changing Environments
MemoBench is a new diagnostic benchmark with automated and VQA metrics that evaluates memory consistency in video models under disappear-and-reappear in dynamic environments.
-
Bridging Vision and Language Concepts through Optimal Transport Semantic Flow
OTF-CBM replaces static cosine similarity in vision-language CBMs with data-driven optimal transport flow to improve concept alignment, accuracy, and faithfulness.
-
MIRAGE: Protecting against Malicious Image Editing via False Moderation
MIRAGE immunizes images by crafting perturbations that align them with policy-violating concepts in open-source moderation models, triggering refusals in closed-source commercial image editors at over 88% success rate.
-
C3-Bench: A Context-Aware Change Captioning Benchmark
C3-Bench supplies a multi-domain dataset and LLM-based evaluation protocol that exposes systematic failures in existing change captioning models outside their training regimes.
-
GRAFT: Graph-Based Affordance Transfer via Part Correspondence
GRAFT transfers manipulation contact points to unseen objects via part-graph retrieval and correspondence from a single demonstration.
-
Rethinking Prototype-based Similarity Learning for Few-Shot Object Detection
Introduces TSMa using text-visual channel interaction and SHARe using ViT layer-aligned autoregressive regression to improve prototype-based few-shot object detection, reporting +10.1 nAP on COCO.
-
HiMatch-AD: DINOv3-driven Hierarchical Matching for Training-free Medical Anomaly Detection
HiMatch-AD proposes DINOv3-driven hierarchical matching with uncertainty-based fusion for training-free medical anomaly detection and reports outperformance on the BMAD benchmark.
-
GroundShot: Visually Consistent Multi-Shot Long Video Generation via Entity-Grounded Shot Scheduling
GroundShot introduces entity-grounded shot scheduling with online visual memory to improve consistency in multi-shot video generation and presents GroundBench for entity-level evaluation.
-
Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale
The normalized inverse-scale direction of LayerNorm's affine parameters is an exact algebraic kernel of the post-final-norm centred activation covariance for any input distribution in LayerNorm transformers.
-
Polarisation and Faraday rotation measure imaging at metre wavelengths with sub-arcsecond resolution: a foundational calibration strategy
A calibration strategy using full-Jones corrections with an in-field unpolarised calibrator and visibility-based multi-epoch alignment enables sub-arcsecond polarimetric imaging with LOFAR at metre wavelengths.
-
PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space
PearlVLA achieves SOTA on LIBERO by separating VLM representations into visual grounding and an iterative latent plan branch refined via world model queries and RefineNet with process-reward RL.
-
Human Universal Grasping
HUG trains a flow-matching model on a new 1M-frame egocentric human grasp dataset to generate retargetable grasps from single RGB-D images, beating baselines by 23-34% on a new 90-object benchmark.
-
Open-World Video Segmentation
Savvy is a practical system for zero-shot open-world long-horizon video segmentation paired with the OGA granularity-aware evaluation protocol that uses n:1 matching and sever-point detection.
-
Pano3D: Unified 3D Reconstruction and Panoptic Segmentation
Pano3D augments 3D feedforward reconstruction backbones with a set-based mask decoder and joint geometric-semantic training to achieve SOTA 3D panoptic segmentation on ScanNet, ScanNet200, and ScanNet++.
-
CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation
CineOrchestra unifies control of subjects, events, cameras, and shot transitions in cinematic video generation through entity-centric conditioning primitives and parameter-free coordinated rotary embeddings.
-
JSCGC: Joint Source-Channel-Generation Coding for Wireless Generative Communications
JSCGC replaces the conventional decoder with a generative model for controlled sampling, reformulating communication as mutual information maximization under perceptual constraints and showing improved semantic and distributional quality in image transmission experiments.
-
DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images
DepthMaster unifies metric monocular depth estimation for perspective and panoramic images by patching panoramas into perspective views, adding a consistency loss and virtual cameras, and training mostly on perspective data to reach SOTA zero-shot results on 13 datasets.
-
SSR-Merge: Subspace Signal Routing for Training-Free LoRA Merging in Diffusion Models
SSR-Merge merges LoRAs via subspace construction, inverse correlation decorrelation, and directional steering, shown to match the OLS solution with a streaming implementation that outperforms prior merging methods.
-
Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning
Rea2Seg turns image segmentation into candidate mask discovery from MLLM attention followed by MLLM-based comparative scoring and selection, plus a new multi-dimensional reasoning benchmark ReasonSeg-SGDR.
-
DRIFT: From Robustness Gaps to Invariance Manifolds for AI-Generated Image Detection
DRIFT learns a structured invariance manifold from real images via one-class supervision on decomposed robust and fragile subspaces of a frozen VFM to detect AI-generated images through margin violations.
-
CoFi-UCGen: Coarse-to-Fine Unsupervised Conditional Generation without Label Priors
CoFi-UCGen achieves both coarse- and fine-grained unsupervised conditional image generation by using bit-codes for structured latent space and hierarchical modulation in diffusion models.
-
Balancing Image Compression and Generation with Bootstrapped Tokenization
SelfBootTok decomposes image tokens into global and local groups via self-bootstrapped learning, enabling generators to use only global tokens for ~40% less computation and a new SOTA gFID of 1.56 with 64 tokens.
-
Imagine Before You Draw: Visual Prompt Engineering for Image Generation
VPE inserts an internal autoregressive visual semantic token generation step to guide image token production in unified models, reporting faster convergence, higher quality, and superior editing preservation (PSNR 26.76 vs 19.92) versus external alternatives.