DINOv3

Oriane Sim\'eoni , Huy V. Vo , Maximilian Seitzer , Federico Baldassarre , Maxime Oquab , Cijo Jose , Vasil Khalidov , Marc Szafraniec

show 18 more authors

Seungeun Yi Micha\"el Ramamonjisoa Francisco Massa Daniel Haziza Luca Wehrstedt Jianyuan Wang Timoth\'ee Darcet Th\'eo Moutakanni Leonel Sentana Claire Roberts Andrea Vedaldi Jamie Tolan John Brandt Camille Couprie Julien Mairal Herv\'e J\'egou Patrick Labatut Piotr Bojanowski

Authors on Pith no claims yet

classification 💻 cs.CV cs.LG

keywords dinov3modelsvisiondatamodeltasksdensediverse

0 comments

read the original abstract

Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media
cs.CV 2026-05 unverdicted novelty 7.0

PROVE proposes RC metrics for perceptual removal coherence and releases PROVE-Bench to better align automatic scores with human judgments on object removal tasks.
LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction
cs.RO 2026-05 unverdicted novelty 7.0

LEXI-SG is the first monocular RGB system for dense open-vocabulary 3D scene graphs that partitions scenes into rooms and performs feed-forward reconstruction per room before global factor-graph alignment.
Local Conformal Calibration of Dynamics Uncertainty from Semantic Images
cs.RO 2026-05 unverdicted novelty 7.0

OCULAR calibrates dynamics uncertainty using perception from similar environments to give guaranteed prediction regions for unseen test conditions.
AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
cs.CV 2026-05 unverdicted novelty 7.0

AssemblyBench dataset and AssemblyDyno transformer model enable physics-aware prediction of assembly sequences and trajectories for complex industrial objects from multimodal instructions and 3D shapes.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference
cs.CV 2026-05 unverdicted novelty 7.0

VIP evolves text prompts using visual cues and saliency-aware aggregation inside dino.txt to deliver 1.4-8.4% higher mIoU on dense vision-language tasks with low overhead.
DriftXpress: Faster Drifting Models via Projected RKHS Fields
cs.LG 2026-05 unverdicted novelty 7.0

DriftXpress approximates drifting kernels via projected RKHS fields to lower training cost of one-step generative models while matching original FID scores.
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...
Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection
cs.CV 2026-05 unverdicted novelty 7.0

AVA-DINO uses dual anomaly-aware branches with text-guided routing and regularization on DINOv3 features to achieve state-of-the-art zero-shot anomaly detection on industrial and medical benchmarks.
H2G: Hierarchy-Aware Hyperbolic Grouping for 3D Scenes
cs.CV 2026-05 unverdicted novelty 7.0

H2G distills 2D foundation-model affinities into a Lorentz hyperbolic feature field that represents hierarchical 3D groupings at multiple granularities.
Revisiting Shadow Detection from a Vision-Language Perspective
cs.CV 2026-05 unverdicted novelty 7.0

SVL uses language embeddings aligned with global image representations via shadow ratio regression and global-to-local coupling to improve shadow detection robustness in ambiguous cases.
3D-Belief: Embodied Belief Inference via Generative 3D World Modeling
cs.CV 2026-05 unverdicted novelty 7.0

3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
cs.LG 2026-05 unverdicted novelty 7.0

MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
Adaptive Context Matters: Towards Provable Multi-Modality Guidance for Super-Resolution
cs.CV 2026-05 unverdicted novelty 7.0

The paper provides the first theoretical analysis of multi-modal super-resolution and proposes M³ESR, a mixture-of-experts framework with spatially dynamic and temporally adaptive modality weighting that improves gene...
The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?
cs.AI 2026-05 unverdicted novelty 7.0

Language representations serve as the asymptotic attractor for convergence in independently trained multimodal neural networks due to feature density asymmetry.
NeuralBench: A Unifying Framework to Benchmark NeuroAI Models
cs.LG 2026-05 conditional novelty 7.0

NeuralBench is a new benchmarking framework for neuroAI models on EEG data that finds foundation models only marginally outperform task-specific ones while many tasks like cognitive decoding stay highly challenging.
Differentiable Ray Tracing with Gaussians for Unified Radio Propagation Simulation and View Synthesis
cs.CV 2026-05 unverdicted novelty 7.0

Embedding Gaussian primitives into a ray tracing structure enables unified radio propagation simulation and view synthesis from visual-only reconstructions.
Velocity-Space 3D Asset Editing
cs.GR 2026-05 unverdicted novelty 7.0

VS3D performs local 3D asset editing by injecting reconstruction-anchored source signals, partial-mean guidance, and twin-agreement residuals into the velocity sampler to control edit strength and preserve identity.
Attention Transfer Is Not Universally Effective for Vision Transformers
cs.CV 2026-05 accept novelty 7.0

Attention transfer from ViT teachers succeeds for only 7 of 11 families and fails for the rest because of architectural mismatch between teacher and student.
Learning Visual Feature-Based World Models via Residual Latent Action
cs.CV 2026-05 unverdicted novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Autoregressive Visual Generation Needs a Prologue
cs.CV 2026-05 unverdicted novelty 7.0

Prologue introduces dedicated prologue tokens to decouple generation and reconstruction in AR visual models, significantly improving generation FID scores on ImageNet while maintaining reconstruction quality.
SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters
cs.CV 2026-05 unverdicted novelty 7.0

SpectraDINO adapts frozen DINOv2 backbones to multispectral data via per-modality adapters and staged distillation with cosine, contrastive, patch, and neighborhood-structure losses, achieving SOTA on object detection...
Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence
cs.CV 2026-05 unverdicted novelty 7.0

MOCHI enables registration-free training of multi-view 3D face reconstruction by enforcing topological consistency via a pseudo-linear inverse kinematic solver, using synthetic-data-trained 2D landmarks for alignment,...
AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images
cs.CV 2026-04 unverdicted novelty 7.0

AEGIS benchmark reveals that leading AI models achieve only 48.80% overall accuracy and low localization precision when analyzing AI-generated academic images, exposing gaps between generative and forensic capabilities.
TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On
cs.CV 2026-04 unverdicted novelty 7.0

A new large-scale triplet dataset and diffusion transformer model using coarse human masks deliver improved video virtual try-on quality and generalization in challenging real-world conditions.
Towards All-Day Perception for Off-Road Driving: A Large-Scale Multispectral Dataset and Comprehensive Benchmark
cs.CV 2026-04 unverdicted novelty 7.0

Presents the first large-scale infrared off-road dataset and a flow-free temporal model achieving state-of-the-art freespace detection performance with real-time inference.
GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution
cs.CV 2026-04 unverdicted novelty 7.0

GramSR uses DINOv3 visual features instead of text captions to condition a one-step diffusion model for super-resolution via sequential pixel, semantic, and texture LoRA modules.
OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer
cs.CV 2026-04 unverdicted novelty 7.0

OmniShotCut treats shot boundary detection as structured relational prediction via a shot-query Transformer, uses fully synthetic transitions for training data, and releases OmniShotCutBench for evaluation.
A satellite foundation model for improved wealth monitoring
cs.CY 2026-04 unverdicted novelty 7.0

Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and genera...
Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos
cs.CV 2026-04 unverdicted novelty 7.0

Introduces the diagnosis-driven CE video summarization task, the VideoCAP dataset with 240 annotated videos, and the DiCE framework that outperforms prior methods by screening candidates then weaving them into diagnos...
Instance-level Visual Active Tracking with Occlusion-Aware Planning
cs.CV 2026-04 unverdicted novelty 7.0

OA-VAT improves visual active tracking by combining instance-level prototype discrimination with occlusion-aware diffusion planning, reporting gains over prior SOTA on simulated and real drone benchmarks.
Evaluating Remote Sensing Image Captions Beyond Metric Biases
cs.CV 2026-04 unverdicted novelty 7.0

Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA pe...
Feasibility of Indoor Frame-Wise Lidar Semantic Segmentation via Distillation from Visual Foundation Model
cs.CV 2026-04 unverdicted novelty 7.0

Distillation from visual foundation models to lidar enables frame-wise indoor semantic segmentation without manual annotations, achieving up to 56% mIoU on pseudo labels and 36% on real labels.
Match-Any-Events: Zero-Shot Motion-Robust Feature Matching Across Wide Baselines for Event Cameras
cs.CV 2026-04 unverdicted novelty 7.0

A single attention-based model trained on synthetic wide-baseline event data achieves zero-shot feature matching across unseen datasets with a reported 37.7% improvement over prior event matching methods.
DifFoundMAD: Foundation Models meet Differential Morphing Attack Detection
cs.CV 2026-04 unverdicted novelty 7.0

DifFoundMAD improves differential morphing attack detection by replacing traditional embeddings with those from vision foundation models and applying class-balanced lightweight fine-tuning, cutting high-security error...
View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity
cs.CV 2026-04 unverdicted novelty 7.0

A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.
Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals
cs.AI 2026-04 unverdicted novelty 7.0

Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to reta...
OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism
cs.CV 2026-04 unverdicted novelty 7.0

OmniGCD trains a Transformer once on synthetic data to enable zero-shot generalized category discovery across 16 datasets in four modalities without any dataset-specific fine-tuning.
Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery
cs.MM 2026-04 unverdicted novelty 7.0

Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pa...
Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers
cs.CV 2026-04 accept novelty 7.0

Zero-ablation overstates register content dependence in DINO ViTs because mean, noise, and cross-image shuffle replacements preserve performance while zeroing does not.
UNIGEOCLIP: Unified Geospatial Contrastive Learning
cs.CV 2026-04 unverdicted novelty 7.0

UNIGEOCLIP creates a unified embedding for aerial imagery, street views, elevation, text, and coordinates via all-to-all contrastive alignment plus a scaled lat-long encoder, outperforming single-modality and coordina...
Development and evaluation of CADe systems in low-prevalence setting: The RARE25 challenge for early detection of Barrett's neoplasia
cs.CV 2026-04 unverdicted novelty 7.0

The RARE25 challenge finds that CADe systems achieve strong discrimination on Barrett's neoplasia but low positive predictive values under realistic low prevalence, with all 11 submissions relying on fully supervised ...
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
cs.LG 2026-04 unverdicted novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
A Minimal Model of Representation Collapse: Frustration, Stop-Gradient, and Dynamics
cond-mat.dis-nn 2026-04 unverdicted novelty 7.0

A minimal embedding model shows representation collapse arises from frustrated samples through slow dynamics and is prevented by stop-gradient.
CAD 100K: A Comprehensive Multi-Task Dataset for Car Related Visual Anomaly Detection
cs.CV 2026-04 unverdicted novelty 7.0

CAD 100K is the first comprehensive multi-task dataset for car-related visual anomaly detection, spanning 7 domains and 3 tasks with synthetic augmentation for few-shot cases.
RS-OVC: Open-Vocabulary Counting for Remote-Sensing Data
cs.CV 2026-04 unverdicted novelty 7.0

RS-OVC is the first open-vocabulary counting model for remote-sensing imagery that enables accurate counts of novel object classes unseen during training via textual or visual conditioning.
SPAMoE: Spectrum-Aware Hybrid Operator Framework for Full-Waveform Inversion
cs.LG 2026-04 unverdicted novelty 7.0

SPAMoE reduces average MAE by 44.4% on OpenFWI datasets for full-waveform inversion via a spectral-preserving DINO encoder and dynamic frequency-band routing to specialized neural operators.
From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal
cs.CV 2026-04 conditional novelty 7.0

Visual encoders leak identity information; a one-shot linear subspace removal method (ISP) reduces leakage to near-chance levels while retaining high non-biometric utility across datasets.
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
cs.CV 2026-04 conditional novelty 7.0

Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.
Are We Recognizing the Jaguar or Its Background? A Diagnostic Framework for Jaguar Re-Identification
cs.CV 2026-04 unverdicted novelty 7.0

A new diagnostic framework using inpainted context ratios and laterality checks on a Pantanal jaguar benchmark reveals whether re-ID models depend on coat patterns or spurious background evidence.
InCaRPose: In-Cabin Relative Camera Pose Estimation Model and Dataset
cs.CV 2026-04 unverdicted novelty 7.0

InCaRPose is a Transformer-based model trained on synthetic data that predicts absolute metric-scale relative poses between distorted in-cabin camera views and generalizes to real images while releasing a new test dataset.
Beauty in the Eye of AI: Aligning LLMs and Vision Models with Human Aesthetics in Network Visualization
cs.LG 2026-04 conditional novelty 7.0

LLMs and vision models achieve human-human alignment levels in judging network visualization aesthetics through prompt engineering on a new dataset of human preferences from 27 participants.
VOSR: A Vision-Only Generative Model for Image Super-Resolution
cs.CV 2026-04 conditional novelty 7.0

VOSR shows that competitive generative image super-resolution with faithful structures can be achieved by training a diffusion-style model from scratch on visual data alone, using a vision encoder for guidance and a r...
SPG: Sparse-Projected Guides with Sparse Autoencoders for Zero-Shot Anomaly Detection
cs.CV 2026-04 unverdicted novelty 7.0

SPG uses sparse autoencoders to learn guide coefficients that generate normal and anomalous reference vectors, achieving competitive zero-shot anomaly detection and strong segmentation on MVTec AD and VisA without tar...
Generalized Small Object Detection:A Point-Prompted Paradigm and Benchmark
cs.CV 2026-04 unverdicted novelty 7.0

TinySet-9M dataset and DEAL point-prompted framework deliver 31.4% relative AP75 gain over supervised baselines for small object detection with one click at inference and generalization to unseen categories.
Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
cs.CV 2026-04 unverdicted novelty 7.0

Pretraining on 1M wild videos followed by post-training on curated data yields high-fidelity feedforward 3D avatars that generalize across identities, clothing, and lighting with emergent relightability and loose-garm...
Satellite-Free Training for Drone-View Geo-Localization
cs.CV 2026-04 conditional novelty 7.0

A satellite-free training framework reconstructs 3D drone scenes via Gaussian splatting, generates geometry-normalized pseudo-orthophotos, and aggregates DINOv3 features with a Fisher vector model trained only on dron...
UniDAC: Universal Metric Depth Estimation for Any Camera
cs.CV 2026-03 unverdicted novelty 7.0

UniDAC achieves universal metric depth estimation across camera types by decoupling relative depth prediction from spatially varying scale estimation using a depth-guided module and distortion-aware positional embedding.
Visual-ERM: Reward Modeling for Visual Equivalence
cs.CV 2026-03 unverdicted novelty 7.0

Visual-ERM is a new multimodal reward model that supplies fine-grained visual feedback for training vision-language models on chart-to-code, table, and SVG tasks, yielding measurable gains over prior rewards.