DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
super hub Mixed citations
DINOv2: Learning Robust Visual Features without Supervision
Mixed citation behavior. Most common role is background (44%).
abstract
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques
authors
co-cited works
representative citing papers
X-Palm supplies the first paired multispectral-to-smartphone palmprint dataset with broad real-world variability to support cross-domain biometric authentication.
Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.
A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.
Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
A training-free prototype memory-guided framework for multi-class prenatal ultrasound anomaly classification and localization using few reference images per class, validated on a 9-category multi-center dataset.
EPO is a trackless, edge-map-alignment framework that refines pose estimates from 3D foundation models and matches or exceeds bundle-adjustment performance with substantially lower runtime and memory use.
GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.
WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.
AnyMatch synthesizes large-scale geometrically consistent multi-modal image pairs from single-view images, enabling fine-tuned matching networks to achieve substantial gains on benchmarks.
A new dataset of 220k+ cross-view pairs and a single-stage geometry-aware model GAGeo based on the π³ 3D foundation model outperforms prior methods on object geo-localization with strong generalization and zero-shot ground-to-drone capability.
First complete digital unwrapping and reading of a Herculaneum papyrus scroll (PHerc. 1667) via synchrotron X-ray CT, virtual unrolling, and machine learning.
Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.
Constructs G-equivariant ViTs for arbitrary discrete G ≤ O(2), proves H ≤ G implies G-models embed into H-models and single-head equivariant attention realizes all ordinary G-equivariant maps, introduces D6 hexagonal model, and reports preliminary accuracy gains on PatternNet in low-data regimes.
BiLoc is the first binary neural network framework for 6-DoF LiDAR pose estimation that uses an auxiliary objective to adaptively regulate information retention and achieve SOTA among BNNs on large outdoor datasets.
SHOW is a mask-promptable framework coupling feed-forward scene reconstruction with human mesh recovery in a unified metric space to resolve scale ambiguity and improve human-scene alignment from monocular video.
MIRAGE immunizes images by crafting perturbations that align them with policy-violating concepts in open-source moderation models, triggering refusals in closed-source commercial image editors at over 88% success rate.
Introduces TSMa using text-visual channel interaction and SHARe using ViT layer-aligned autoregressive regression to improve prototype-based few-shot object detection, reporting +10.1 nAP on COCO.
citing papers explorer
-
GenHSI: Controllable Generation of Human-Scene Interaction Videos
GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D inpainting plus optimization, and then feeding them to pre-trained video diffusion.
-
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.
-
A European Multi-Center Breast Cancer MRI Dataset
Releases a new public multi-center European breast MRI dataset of 741 cases with heterogeneous protocols and provides baseline transformer model benchmarks.
-
FractalMamba++: Scaling Vision Mamba Across Resolutions via Hilbert Fractal Geometry
FractalMamba++ scales Vision Mamba across resolutions by using Hilbert fractal serialization, hierarchy-based skip connections, and fractal-aware 2D rotary position encoding.
-
VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold
VGGT-SLAM aligns VGGT submaps via SL(4) manifold optimization of 15-DoF homographies to enable consistent dense RGB SLAM on long uncalibrated monocular videos.
-
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer
ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.
-
Toward Generalizable Forgery Detection and Reasoning
FakeReasoning is an MLLM-based framework for unified forgery detection and reasoning on AI-generated images, supported by the new MMFR-Dataset of 120K images and 378K annotations across 10 generators.
-
Adaptive Camera Sensor for Vision Models
Lens adapts camera sensors in real time via the VisiT confidence-based quality indicator to improve vision model accuracy on domain-shifted images, shown on ImageNet-ES and a new diverse benchmark.
-
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
-
Massive Activations in Large Language Models
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
-
Project Aria: A New Tool for Egocentric Multi-Modal AI Research
Project Aria presents a new wearable egocentric multi-modal recording device and software tools to accelerate AI research for augmented reality applications.
-
RoMa: Robust Dense Feature Matching
RoMa sets new state-of-the-art dense feature matching performance by fusing DINOv2 features with local ConvNet features, using anchor-probability transformer decoding, and regression-by-classification loss, with a 36% gain on WxBS.
-
Personalized Object Identification and Localization via In-Context Inference with Vision-Language Models
IPLoc-ID extends prior localization-only work to full identification and localization by using a self-posed query in VLMs to reject negative images while preserving comparable localization accuracy.
-
Does Your ViT Still Need U-Net for Segmentation?
EoSeg shows that modern ViT backbones support accurate medical image segmentation without U-Net-style decoders via multi-level query modeling and learnable block fusion, with strong results on seven benchmarks.
-
Decompose, Compare, and Decide: Multimodal LLMs are Implicit Few-Shot Learners
DeCoDe decomposes few-shot classification into binary pairwise image comparisons whose affirmative logits serve as similarity scores, enabling strong performance from unmodified MLLMs on twelve datasets.
-
Lost in the Tail: Addressing Geographic Imbalance in Urban Visual Place Recognition
DAPR is a model-agnostic plug-in that rebalances gradient contributions across head and tail classes and applies multi-scale distance search for distributional compactness, improving VPR performance by 18.3% on SF-XL v1 and 6.7% on v2.
-
Towards Voxel Spacing Consistency for Medical Image Segmentation
Consispace is a semantic-aware resampling method that uses an implicit neural network with ODE constraints and feature reweighting to achieve consistent axial voxel spacing while preserving anatomy and semantics, improving downstream segmentation.
-
PrISM-IQA: Image Quality Assessment Made Practical for Smartphone Photography
PrISM-IQA reformulates IQA as multi-issue ordinal diagnosis predicting absent/minor/severe/critical levels for 53 ISP issues using cumulative encoding and structured inference.
-
DPPE: Rethinking Camera-Based Positional Encoding for Scaling Multi-View Transformers
DPPE decouples rotation and translation in camera positional encodings for multi-view transformers to resolve late-stage training stagnation and improve generalization in novel view synthesis.
-
DualBrep: A Dual-Field Continuous Representation for B-rep Modelling
DualBrep encodes B-rep models as dual scalar fields (SDF geometry + UDF topology) compressed into a shared latent space for flow-matching generation and neural B-rep extraction.
-
Delta-JEPA: Learning Action-Sensitive World Models via Latent Difference Decoding
Delta-JEPA augments latent forward prediction with a Latent Difference Action Decoder that reconstructs actions from embedding displacements, yielding action-sensitive world models that improve planning on four visual continuous-control tasks over JEPA baselines.
-
WildProp: Visual Estimation of Wildlife Body Proportions at Scale
A retrieval-based framework using foundation models for pose-aware correspondence to estimate population-level wildlife body proportions from unconstrained images, with reported 10-20% median relative errors on bird and amphibian datasets.
-
GROW$^2$: Grounding Which and Where for Robot Tool Use
GROW² hierarchically grounds open-world tool affordances by using VLMs for semantic selection of objects and parts followed by geometric localization with vision foundation models.
-
Sequential Planning via Anchored Robotic Keypoints
SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.
-
Benchmark AUC Is Not Deployable Reliability: A Cross-Dataset Audit of Off-the-Shelf Features for Surveillance Video Anomaly Detection
Cross-dataset testing of nearest-neighbor and Mahalanobis anomaly detectors on CLIP, DINOv2, ResNet-50 and EfficientNet embeddings shows same-dataset AUC averaging 0.704 dropping to 0.499 on other datasets, with false-alarm rates around 31,931 per hour at usable operating points.
-
Rectifying Mask via Entropy for Distractor-Free 3DGS in Ambiguous Scenarios
RefineSplat applies entropy-aware adaptive masking and density control to 3DGS to remove color- or semantically ambiguous distractors, validated on a new 18-scene Ambiguous wild dataset with claimed SOTA results.
-
Dynamic Parsing and Updating Natural Language Specification using VLMs for Robust Vision-Language Tracking
A language dependency parsing mechanism combined with Qwen-VL enables adaptive updates to textual descriptions for improved vision-language tracking performance on benchmarks like TNL2K and LaSOT.
-
Multi-scale Object-Aware Gaze Estimation via Geometric Reasoning
A two-stage object-aware gaze estimation method with multi-scale feature fusion and geometric constraints reports AUC scores of 0.961, 0.948, 0.987, and 0.977 on GazeFollow, VideoAttentionTarget, ChildPlay, and GOO-Real with a 7.1M parameter model.
-
MoPe: Motion Permanence for Robust Monocular Gaussian Mapping in Dynamic Environments
MoPe propagates historical dynamic posteriors via SE(3) warping and bounded Bayesian fusion to maintain persistent motion state in monocular Gaussian SLAM.
-
Flow Matching in Feature Space for Stochastic World Modeling
FlowWM applies flow matching directly in pretrained feature space with a one-step projection mechanism, improving perception accuracy, mode coverage, and horizon robustness on synthetic and real-world benchmarks.
-
Envisage: Diffusion-Based Rhinoplasty Goal Visualization with Mask-Decomposed Evaluation
Envisage applies FLUX.1 inpainting to rhinoplasty goal visualization and shows via SurgicalScore that mask-decomposed metrics outperform full-face identity scores for hard-composited localized edits.
-
Learning Topology-Aware Representations via Test-Time Adaptation for Anomaly Segmentation
TopoTTA integrates persistent homology into test-time adaptation to derive topological pseudo-labels from anomaly maps, improving segmentation by an average 15% F1 on six benchmarks while generalizing across 2D and 3D data.
-
VLM-Aware Meta-Optic Front-End Design for Frozen Vision-Language Models
CODA optimizes continuous-density meta-optics via adjoint gradients on Maxwell simulations to boost frozen CLIP zero-shot accuracy on ImageNet-100 from 53.75% to 65.41%, with transfer to other models.
-
Beyond Points: Spherical Distributional Part Prototypes for Interpretable Classification
vMFProto models each class as a mixture of von Mises-Fisher components on the hypersphere, learns per-prototype concentrations, and applies entropic OT for assignments, yielding SOTA explanation quality on CUB, Dogs, and Cars with frozen DINO backbones.
-
ViQ: Text-Aligned Visual Quantized Representations at Any Resolution
ViQ is a new two-stage text-aligned quantization method for visual features supporting arbitrary resolutions that claims competitive multimodal performance with efficiency gains of 20-70%.
-
SatSplatDiff: Geometry-preserving generative refinement for high-fidelity satellite Gaussian Splatting
SatSplatDiff combines depth supervision and shadow-guided generative refinement with 2DGS to reduce geometric MAE by up to 18% and improve visual fidelity by 28-45% on satellite datasets while enabling 5x resolution enhancement.
-
Forget, Anticipate and Adapt: Test Time Training for Long Videos
FFN performs TTT on multi-hour videos by restricting updates to three frames and using a surprise metric for adaptive window sizing, plus a new EpicTours dataset.
-
MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation
MIMFlow uses a VAE on masked images to feed semantic latents to a normalizing flow while a decoder handles high-frequency details, reporting FID 2.50 and 71.3% linear probing on ImageNet 256x256 with 128 tokens.
-
Invoice Haystack: Benchmarking Document Retrieval and Visual Question Answering Under Strong Visual Homogeneity
Presents Invoice Haystack benchmark for homogeneous document retrieval and VL-RAG hybrid framework achieving 60% Recall@1 and up to 13.5 point gains over prior methods.
-
HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
HYDRA-X presents the first unified multimodal model using a single ViT for holistic image-video tokenization, with ablations on attention and compression plus a latent-level editing improvement.
-
ViPER: Vision-based Packing-Aware Encoder for Robust Malware Detection
ViPER uses a LoRA-adapted ViT-B/14 with dual heads for malware classification and packing detection plus a gating mechanism and weighted losses to reach 0.8521 balanced accuracy on 200k Windows PE images while detecting packing at 0.9949 AUC.
-
Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning
DSSA decouples per-frame appearance from temporal identity in slot attention mechanisms to reduce slot swapping and improve temporal consistency in video object segmentation.
-
Action-Effect Memory Pretraining for Robot Manipulation
AEM pretrains compact history representations via masked modeling on interleaved vision-action sequences to boost downstream robot manipulation in simulation and real settings.
-
Tac-DINO: Learning Vision-Tactile Features with Patch Alignment
Tac-DINO constructs a large tactile dataset and Vis-Tac Holographic Matching Benchmark, then proposes Vision-Tactile Patch Alignment (VTPA) methods that outperform non-aligned baselines on local-to-global feature matching.
-
LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation
LASA aggregates multi-layer attention from vision transformers to enable weakly supervised open-vocabulary semantic segmentation on scene sketches, reporting mIoU gains of +3.43 to +15.74 on three benchmarks over prior baselines.
-
Cross-Modal Benchmarking for Robotic Perception in Natural Environments
Presents the WildCross benchmark with 476K frames for place recognition and metric depth estimation in natural environments, demonstrating limitations of existing vision models.
-
TacForeSight: Force-Guided Tactile World Model for Contact-Rich Manipulation
TacForeSight trains a force-conditioned tactile world model to predict latent dynamics and uses those predictions as anticipatory priors inside a visuo-tactile policy for real-time contact-rich manipulation.
-
Envision4D: Envisioning Visual Futures via Feed-forward 4D Gaussian Splatting for Autonomous Driving
Envision4D presents a feed-forward 4D Gaussian Splatting framework with future pose prediction, temporal attention, and conditioned motion lifting for pose-free extrapolation in autonomous driving scenes.
-
Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning
BFQ enables single-step noise-to-action mapping in offline RL by dividing flow-path displacements into bootstrappable short-range components learned from marginal velocity.