DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
super hub Mixed citations
DINOv2: Learning Robust Visual Features without Supervision
Mixed citation behavior. Most common role is background (44%).
abstract
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques
authors
co-cited works
representative citing papers
X-Palm supplies the first paired multispectral-to-smartphone palmprint dataset with broad real-world variability to support cross-domain biometric authentication.
Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.
A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.
Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Introduces VICIS task and training framework for inferring visual concepts from image sets, with experiments showing better accuracy, diversity, and generalization than standard VLMs on synthetic and ImageNet data.
InvSplat is a feed-forward multi-view model that predicts 3D Gaussians augmented with intrinsic material attributes for inverse rendering and relighting.
The subspace intervention framework reveals that pre-training objectives shape how ViTs encode geometric information in compressible low-rank subspaces, with peak precision at intermediate layers.
A systematic survey unifies presentation, digital injection, and GenAI synthesis attacks on identity documents, audits datasets for a reality gap, identifies SDGI in multimodal models, and reports APCER above 25% for top models on synthetic IDs.
A training-free prototype memory-guided framework for multi-class prenatal ultrasound anomaly classification and localization using few reference images per class, validated on a 9-category multi-center dataset.
EPO is a trackless, edge-map-alignment framework that refines pose estimates from 3D foundation models and matches or exceeds bundle-adjustment performance with substantially lower runtime and memory use.
GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.
WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.
AnyMatch synthesizes large-scale geometrically consistent multi-modal image pairs from single-view images, enabling fine-tuned matching networks to achieve substantial gains on benchmarks.
A new dataset of 220k+ cross-view pairs and a single-stage geometry-aware model GAGeo based on the π³ 3D foundation model outperforms prior methods on object geo-localization with strong generalization and zero-shot ground-to-drone capability.
First complete digital unwrapping and reading of a Herculaneum papyrus scroll (PHerc. 1667) via synchrotron X-ray CT, virtual unrolling, and machine learning.
Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.
Constructs G-equivariant ViTs for arbitrary discrete G ≤ O(2), proves H ≤ G implies G-models embed into H-models and single-head equivariant attention realizes all ordinary G-equivariant maps, introduces D6 hexagonal model, and reports preliminary accuracy gains on PatternNet in low-data regimes.
citing papers explorer
-
MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation
MIMFlow uses a VAE on masked images to feed semantic latents to a normalizing flow while a decoder handles high-frequency details, reporting FID 2.50 and 71.3% linear probing on ImageNet 256x256 with 128 tokens.
-
Invoice Haystack: Benchmarking Document Retrieval and Visual Question Answering Under Strong Visual Homogeneity
Presents Invoice Haystack benchmark for homogeneous document retrieval and VL-RAG hybrid framework achieving 60% Recall@1 and up to 13.5 point gains over prior methods.
-
RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer
RegimeVGGT applies layer-wise U-shaped compression via saliency-guided banded merging and selectively protected K/V downsampling to deliver 6.7x speedup on VGGT at matched reconstruction quality.
-
SierpinskiCam: Camera-Controlled Video Retaking with Sierpinski Triangle Pattern Cues
SierpinskiCam adds Sierpinski dome texture cues and negative-RoPE reference video conditioning to geometry-guided video diffusion to improve camera controllability and consistency in video retaking.
-
Contrastive Action-Image Pre-training for Visuomotor Control
CAIP learns action-aligned visual representations via contrastive pre-training on human hand keypoints from egocentric video, outperforming DINOv2, SigLIP, MVP, and R3M with >30% gains on real dexterous manipulation tasks.
-
Modality Forcing for Scalable Spatial Generation
Modality Forcing lets a single DiT produce image and depth outputs in any order after training on sparse real-world depth, with larger image-pretrained models yielding better depth accuracy and a 57% AbsRel reduction versus prior joint generative baselines.
-
HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
HYDRA-X presents the first unified multimodal model using a single ViT for holistic image-video tokenization, with ablations on attention and compression plus a latent-level editing improvement.
-
ViPER: Vision-based Packing-Aware Encoder for Robust Malware Detection
ViPER uses a LoRA-adapted ViT-B/14 with dual heads for malware classification and packing detection plus a gating mechanism and weighted losses to reach 0.8521 balanced accuracy on 200k Windows PE images while detecting packing at 0.9949 AUC.
-
Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning
DSSA decouples per-frame appearance from temporal identity in slot attention mechanisms to reduce slot swapping and improve temporal consistency in video object segmentation.
-
Action-Effect Memory Pretraining for Robot Manipulation
AEM pretrains compact history representations via masked modeling on interleaved vision-action sequences to boost downstream robot manipulation in simulation and real settings.
-
Tac-DINO: Learning Vision-Tactile Features with Patch Alignment
Tac-DINO constructs a large tactile dataset and Vis-Tac Holographic Matching Benchmark, then proposes Vision-Tactile Patch Alignment (VTPA) methods that outperform non-aligned baselines on local-to-global feature matching.
-
LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation
LASA aggregates multi-layer attention from vision transformers to enable weakly supervised open-vocabulary semantic segmentation on scene sketches, reporting mIoU gains of +3.43 to +15.74 on three benchmarks over prior baselines.
-
Cross-Modal Benchmarking for Robotic Perception in Natural Environments
Presents the WildCross benchmark with 476K frames for place recognition and metric depth estimation in natural environments, demonstrating limitations of existing vision models.
-
TacForeSight: Force-Guided Tactile World Model for Contact-Rich Manipulation
TacForeSight trains a force-conditioned tactile world model to predict latent dynamics and uses those predictions as anticipatory priors inside a visuo-tactile policy for real-time contact-rich manipulation.
-
Envision4D: Envisioning Visual Futures via Feed-forward 4D Gaussian Splatting for Autonomous Driving
Envision4D presents a feed-forward 4D Gaussian Splatting framework with future pose prediction, temporal attention, and conditioned motion lifting for pose-free extrapolation in autonomous driving scenes.
-
Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning
BFQ enables single-step noise-to-action mapping in offline RL by dividing flow-path displacements into bootstrappable short-range components learned from marginal velocity.
-
Popcorn: A Configurable Benchmark for Visual Evidence in Multimodal Movie Recommendation
Popcorn is a new benchmark standardizing modality assembly, fusion, and evaluation of thumbnails, trailers, and full movies encoded by VLMs for multimodal movie recommendation.
-
See More, Match Better: Multi-Source Feature Fusion for Two-View Correspondence Learning
TriMatch fuses geometric, texture semantic, and structural semantic features via dedicated alignment and modulation modules to improve inlier-outlier discrimination in two-view correspondence learning.
-
G2G: Exploiting Intra-Group Geometry for Inter-Group Pose Estimation
G2G attaches three small trainable modules to frozen backbones and reports state-of-the-art inter-group pose accuracy on four datasets spanning simulation, real cross-season, and sim-to-real transfer using only relative-pose supervision.
-
DALE-CT: Depth-Aware Foundation Models for Computed Tomography
DALE-CT, a 2D LeJEPA model with depth-aware dual supervision, reaches 0.833 Macro AUROC on multi-abnormality detection in CT and approaches 3D SOTA performance using less data and no textual supervision.
-
LARA: Latent Action Representation Alignment for Vision-Language-Action Models
LARA jointly optimizes LAM and VLA models via representation alignment to improve robotic manipulation performance using human videos.
-
ForensicConcept: Transferable Forensic Concepts for AIGI Detection
ForensicConcept extracts and transfers forensic concepts from AIGI detectors via Transformer attribution, concept codebooks, CleanDIFT references, and CKNNA alignment to improve detection on unseen generators.
-
DaX: Learning General Pathology Representations Across Scales
DaX is a pathology vision foundation model that extends DINOv3 with continuous magnification training and cross-scale consistency, achieving top average performance on a benchmark of 161 tasks from 44 datasets covering 28k patients.
-
Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy
DirectAnimator bypasses pose extraction using a Driving Cue Triplet and Same2X training strategy to achieve state-of-the-art human animation quality and robustness from raw videos.
-
Meridian: Metric-Semantic Primitive Matching for Cross-View Geo-Localization Beyond Urban Environments
Meridian matches metric-semantic primitives across aerial and ground views for training-free global localization in diverse natural environments, reporting 2.4 m average trajectory error over 19 km.
-
Geometry-Aware Dataset Condensation for Diffusion Model Training
A geometry-aware dataset condensation technique reformulates subset selection as one-sided partial optimal transport alignment plus regularization to improve diffusion model training fidelity.
-
X4Val: Learning Neural Surrogates for Variance-Reduced Policy Evaluation
X4Val learns transferable neural predictors from non-paired multi-domain data and incorporates them into control-variates estimators to reduce variance in real-world robotic policy evaluation by up to 38.4%.
-
TransTac: Visuo-Tactile Modality Transition via Ultraviolet-Encoded Transparent Elastomers
TransTac is a transparent UV-encoded binocular vision-based tactile sensor that integrates visual and marker-based tactile reconstruction, achieving 83.3% zero-shot recognition accuracy and stronger cross-modal alignment than opaque baselines.
-
Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models
GPUA learns an orthogonal mapping from VFM to VLM feature space to preserve geometry and improve cross-model compatibility for zero-shot recognition and segmentation.
-
KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models
KODA uses modality-wise kernel composition and constrained optimization to discover interpretable discrepancy structures between vision-language representations.
-
Beyond Compression: Quantifying Spectral Accessibility in Vision Representations
Vision encoders alter spectral accessibility non-monotonically across depth with architecture-specific effects from projections and pooling, quantified via a new residual loss against random baselines.
-
PRISM: Synergizing Vision Foundation Models via Self-organized Expert Specialization
PRISM is a two-stage MoE framework that achieves new state-of-the-art results on PASCAL-Context and NYUD-v2 by enabling self-organized expert specialization across diverse vision foundation models.
-
GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations
GLINT introduces sparsely gated alignment and dense feature regularization on top of DINOv3 and V-JEPA encoders to enable query-specific zero-shot grounding and segmentation in 2D CXR and 3D CT.
-
DOME: Learning Transferable Domain Variables from Sparse Supervision for Test-Time Adaptation
DOME learns sample-specific domain variables from sparse supervision via vision-language models and a sparse domain bank to improve test-time adaptation performance.
-
BEAST3D: Animal behavioral analysis and neural encoding from multi-view video via Gaussian splatting
BEAST3D learns viewpoint-invariant 3D features from calibrated multi-view animal videos via Gaussian splatting for novel view synthesis, pose estimation, and neural encoding across four species.
-
MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents
MORPHOS introduces an autoregressive 4D generation method with Temporal Structured Latents (T-SLAT) that produces dynamic 3D assets from videos while handling topological changes and long sequences.
-
FlatVPR: Plug-and-play Geo-linear Residual Adapter for Geometric Rectification of Foundation Model Feature Manifolds
FlatVPR adds a learnable residual adapter and a curvature-minimizing loss to foundation-model features so that descriptors between distant anchors can be reconstructed by linear interpolation, improving VPR on NCLT at 100 m spacing.
-
DeblurNVS: Geometric Latent Diffusion for Novel View Synthesis from Sparse Motion-Blurred Images
DeblurNVS restores geometric representations via latent diffusion to enable high-fidelity novel view synthesis directly from sparse motion-blurred inputs.
-
Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs
Dynamic scene graphs serve as explicit memory to improve imitation learning policies for spatial-temporal reasoning under partial observability in mobile and tabletop manipulation.
-
CAFOSat: A Strongly Annotated Dataset for Infrastructure-Aware CAFO Mapping Using High-Resolution Imagery
CAFOSat is a new strongly annotated remote-sensing dataset for CAFO mapping that uses human-in-the-loop refinement and curated negatives, with benchmarks on CNNs, transformers, and vision-language models plus a synthetic augmentation pipeline.
-
Spatial Transcriptomics-Guided Alignment Enhances Molecular Profiling in Pathology Foundation Model
STAMP uses a curated 1.8M-pair spatial transcriptomics atlas and pathway-informed alignment to augment pathology foundation models for molecular phenotype inference from H&E WSIs.
-
HARP-VLA: Human-Robot Aligned Representation Learning for Vision-Language-Action Model
HARP aligns human-robot visual and latent action representations via paired bridges and unpaired dynamics supervision to boost VLA policy performance on manipulation tasks.
-
VLM3: Vision Language Models Are Native 3D Learners
Standard VLMs achieve expert-level 3D performance on depth estimation, pose estimation, and object understanding via three simple techniques without architecture changes or regression losses.
-
Fairness Beyond Demographics: Optimizing Performance Across Appearance-Based Hidden Cohorts in Medical Imaging
LHCF trains medical image models for fairness by optimizing across latent appearance-based cohorts discovered via clustering, achieving SOTA results on single and multiple demographic attributes without using any demographic labels.
-
Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification
Introduces VIP identification task, releases Temporal-VIP dataset, and presents VIP-Net framework that achieves 67.3% accuracy on identifying important persons in videos while providing rationale similarity of 0.63.
-
Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation
DeGO decouples rigid and nonrigid motion in Gaussian occupancy prediction via factorized 4D distillation from VGGT, reporting SOTA results on Occ3D-NuScenes with 13.5% gains on human-centric cases.
-
Turning Video Models into Generalist Robot Policies
Decouples action-free video world models from embodiment-specific IDMs using Jacobian-based translation to achieve zero-shot cross-embodiment robot policies.
-
Trinity: Unifying Class-Agnostic Terrain and Semantic Segmentation for Unstructured Outdoor Environments by Leveraging Synthetic Data
Trinity is a unified transformer that performs both class-specific semantic segmentation and class-agnostic terrain segmentation, trained on synthetic RUGDSynth data and evaluated on the new EXTerra real-world dataset.
-
Representation-Conditioned Diffusion Models for Guided Training Data Generation
Representation-conditioned diffusion models generate synthetic ImageNet data that trains classifiers to higher top-1 accuracy than class-conditioned generation (+10.76 pp) or real data (+2.0 pp when scaled).
-
Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery
A Perceiver IO fusion architecture combines satellite and street-level imagery via DINOv2 tokens and RGB-M masking to classify roof attributes on a new dataset of 32,135 buildings across ten countries.