DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
super hub Mixed citations
DINOv2: Learning Robust Visual Features without Supervision
Mixed citation behavior. Most common role is background (44%).
abstract
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques
authors
co-cited works
representative citing papers
X-Palm supplies the first paired multispectral-to-smartphone palmprint dataset with broad real-world variability to support cross-domain biometric authentication.
Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.
A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.
Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
A training-free prototype memory-guided framework for multi-class prenatal ultrasound anomaly classification and localization using few reference images per class, validated on a 9-category multi-center dataset.
EPO is a trackless, edge-map-alignment framework that refines pose estimates from 3D foundation models and matches or exceeds bundle-adjustment performance with substantially lower runtime and memory use.
GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.
WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.
AnyMatch synthesizes large-scale geometrically consistent multi-modal image pairs from single-view images, enabling fine-tuned matching networks to achieve substantial gains on benchmarks.
A new dataset of 220k+ cross-view pairs and a single-stage geometry-aware model GAGeo based on the π³ 3D foundation model outperforms prior methods on object geo-localization with strong generalization and zero-shot ground-to-drone capability.
First complete digital unwrapping and reading of a Herculaneum papyrus scroll (PHerc. 1667) via synchrotron X-ray CT, virtual unrolling, and machine learning.
Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.
Constructs G-equivariant ViTs for arbitrary discrete G ≤ O(2), proves H ≤ G implies G-models embed into H-models and single-head equivariant attention realizes all ordinary G-equivariant maps, introduces D6 hexagonal model, and reports preliminary accuracy gains on PatternNet in low-data regimes.
BiLoc is the first binary neural network framework for 6-DoF LiDAR pose estimation that uses an auxiliary objective to adaptively regulate information retention and achieve SOTA among BNNs on large outdoor datasets.
SHOW is a mask-promptable framework coupling feed-forward scene reconstruction with human mesh recovery in a unified metric space to resolve scale ambiguity and improve human-scene alignment from monocular video.
MIRAGE immunizes images by crafting perturbations that align them with policy-violating concepts in open-source moderation models, triggering refusals in closed-source commercial image editors at over 88% success rate.
Introduces TSMa using text-visual channel interaction and SHARe using ViT layer-aligned autoregressive regression to improve prototype-based few-shot object detection, reporting +10.1 nAP on COCO.
citing papers explorer
-
LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation
LiftFormer transforms monocular depth prediction into depth-oriented geometric and edge-aware subspace representations via lifting and frame theory, achieving state-of-the-art results on standard datasets.
-
Conformal Margin Risk Minimization: An Envelope Framework for Robust Learning under Label Noise
CMRM adds a conformal quantile regularization on prediction margins to any loss, improving noisy-label classification accuracy up to 3.39% across methods and benchmarks while preserving performance at zero noise.
-
Continual Visual Anomaly Detection on the Edge: Benchmark and Efficient Solutions
First benchmark for continual visual anomaly detection on edge devices plus Tiny-Dinomaly, a lightweight DINO-based model with 13x smaller memory, 20x lower compute, and 5-point Pixel F1 gain.
-
Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP
The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model predictions differently.
-
AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation
AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM3D benchmarks.
-
From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal
Visual encoders leak identity information; a one-shot linear subspace removal method (ISP) reduces leakage to near-chance levels while retaining high non-biometric utility across datasets.
-
AvatarPointillist: AutoRegressive 4D Gaussian Avatarization
AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.
-
The Indra Representation Hypothesis for Multimodal Alignment
Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal robustness when instantiated with angular distance.
-
3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI and Gen3DSR while keeping diffusion efficiency.
-
GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models
GENFIG1 is a new benchmark that tests whether vision-language models can create effective Figure 1 visuals capturing the central scientific idea from paper text.
-
Training a Student Expert via Semi-Supervised Foundation Model Distillation
A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.
-
InCaRPose: In-Cabin Relative Camera Pose Estimation Model and Dataset
InCaRPose is a Transformer-based model trained on synthetic data that predicts absolute metric-scale relative poses between distorted in-cabin camera views and generalizes to real images while releasing a new test dataset.
-
Beauty in the Eye of AI: Aligning LLMs and Vision Models with Human Aesthetics in Network Visualization
LLMs and vision models achieve human-human alignment levels in judging network visualization aesthetics through prompt engineering on a new dataset of human preferences from 27 participants.
-
VOSR: A Vision-Only Generative Model for Image Super-Resolution
VOSR shows that competitive generative image super-resolution with faithful structures can be achieved by training a diffusion-style model from scratch on visual data alone, using a vision encoder for guidance and a restoration-oriented sampling strategy.
-
SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation
SD-FSMIS adapts Stable Diffusion for few-shot medical image segmentation via support-query interaction and visual-to-textual translation, yielding competitive performance and strong cross-domain generalization.
-
HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits
HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.
-
DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning
DocShield presents a new agentic reasoning framework using Cross-Cues-aware Chain of Thought to detect, localize, and explain text-centric forgeries in documents, with reported F1 gains of 41.4% over specialized methods and 23.4% over GPT-4o on T-IC13.
-
Satellite-Free Training for Drone-View Geo-Localization
A satellite-free training framework reconstructs 3D drone scenes via Gaussian splatting, generates geometry-normalized pseudo-orthophotos, and aggregates DINOv3 features with a Fisher vector model trained only on drone data to enable cross-view retrieval.
-
TreeGaussian: Tree-Guided Cascaded Contrastive Learning for Hierarchical Consistent 3D Gaussian Scene Segmentation and Understanding
TreeGaussian introduces a tree-guided cascaded contrastive framework that models hierarchical semantic relationships in 3D Gaussian scenes to improve consistent segmentation and understanding.
-
SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
SparseCam4D achieves spatio-temporally consistent high-fidelity 4D reconstruction from sparse cameras via a Spatio-Temporal Distortion Field that corrects inconsistencies in generative observations.
-
EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction
EndoVGGT uses a dynamic DeGAT graph attention module to improve depth estimation and non-rigid 3D reconstruction in surgery, reporting 24.6% PSNR and 9.1% SSIM gains on SCARED with zero-shot generalization to new domains.
-
Positive-First Most Ambiguous: A Simple Active Learning Criterion for Interactive Retrieval of Rare Categories
PF-MA is a new active learning rule that favors likely-positive uncertain samples to speed up discovery of rare categories in imbalanced visual retrieval.
-
How Out-of-Equilibrium Phase Transitions can Seed Pattern Formation in Trained Diffusion Models
Pattern formation in trained diffusion models emerges from out-of-equilibrium phase transitions driven by instabilities in low-frequency denoising modes linked to data symmetries and architectural constraints.
-
SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning
SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.
-
Emergent Compositional Communication for Latent World Properties
Multi-agent iterated learning produces emergent positionally disentangled communication protocols for latent physical properties from unsupervised video features.
-
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
-
HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness
HeiSD delivers up to 2.45x faster inference for embodied VLA models by hybridizing speculative decoding with kinematic boundary detection and error-mitigation tricks while preserving task success rates.
-
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory reduction and 4x faster inference at SOTA quality.
-
AWPD: Frequency Shield Network for Agnostic Watermark Presence Detection
FSNet detects unknown invisible watermarks via adaptive frequency gating and multi-spectral attention on the UniFreq-100K dataset, claiming superior zero-shot performance.
-
SCP: Spatial Causal Prediction in Video
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
-
Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation
DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.
-
Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding
Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent gains on multimodal benchmarks.
-
SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
A training-free method fits PCA to DINOv2 features from few normal images and detects anomalies via reconstruction residual, reaching SOTA one-shot AUROC of 97.1% image-level on MVTec-AD and 93.2% on VisA.
-
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memory savings on the quantized components.
-
Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning
Split-MoPE integrates split learning with predefined-expert routing to maximize usable data in vertical federated learning under sample misalignment, delivering state-of-the-art accuracy in one communication round plus built-in robustness and per-sample contribution scores.
-
RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes
RAD retrieves semantically similar RGB-D context samples for low-confidence regions and fuses them via matched cross-attention to cut relative absolute depth error by 29.2% on NYU Depth v2 underrepresented classes while staying competitive on standard benchmarks.
-
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
-
FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Foreground-Complete 4D Reconstruction
FreeOrbit4D recovers a foreground-complete 4D proxy via decoupled background and object-centric reconstruction to provide geometric guidance for large-angle camera redirection in monocular videos using conditional video diffusion.
-
ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction
ReWeaver reconstructs topology-accurate 3D garments and sewing patterns from sparse multi-view images by predicting seams and panels in 2D UV and 3D space using a new 100k-sample synthetic dataset.
-
UIKA: Fast Universal Head Avatar from Pose-Free Images
UIKA is a feed-forward animatable Gaussian head model using UV-guided correspondence estimation and learnable UV tokens with dual-level attention, trained on large-scale synthetic data to handle pose-free inputs.
-
It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models
Noise optimization during sampling recovers diversity in mode-collapsed diffusion models while preserving output fidelity.
-
MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
MMLandmarks supplies 197k aerial and 329k ground images plus text and GPS for 18,557 landmarks to benchmark multimodal geo-spatial understanding.
-
ART: Articulated Reconstruction Transformer
ART is a category-agnostic transformer that maps sparse multi-state RGB images to per-part 3D geometry, texture, and articulation parameters via learnable part slots.
-
MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
MoCapAnything reconstructs asset-specific BVH animations from monocular video by predicting 3D joint trajectories then applying constraint-aware inverse kinematics guided by a reference prompt encoder.
-
OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation
OpenTrack3D achieves state-of-the-art open-vocabulary 3D instance segmentation by generating cross-view consistent proposals online with a visual-spatial tracker and replacing CLIP with an MLLM for improved compositional reasoning.
-
From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity
Flow matching models follow a two-stage process of navigation across data modes then refinement to nearest samples, revealed by exact computation of the oracle marginal velocity field.
-
FastVGGT: Training-Free Acceleration of Visual Geometry Transformer
FastVGGT achieves 4x speedup on VGGT for 1000-image inputs using training-free token merging tailored to 3D architectures while reducing error accumulation.
-
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
π³ is a feed-forward network with full permutation equivariance that outputs affine-invariant poses and scale-invariant local point maps without reference frames, reaching state-of-the-art on camera pose, depth, and dense reconstruction benchmarks.
-
SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples
SCOOTER supplies best-practice guidelines, open tools, and a 3K-image benchmark with 34K+ human ratings showing that six tested unrestricted attacks produce images humans can detect as fake.
-
GenHSI: Controllable Generation of Human-Scene Interaction Videos
GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D inpainting plus optimization, and then feeding them to pre-trained video diffusion.