super hub Mixed citations

DINOv2: Learning Robust Visual Features without Supervision

Huy Vo, Marc Szafraniec, Maxime Oquab, Vasil Khalidov · 2023 · cs.CV · arXiv 2304.07193

Mixed citation behavior. Most common role is background (44%).

803 Pith papers citing it

Background 44% of classified citations

open full Pith review browse 803 citing papers more from Huy Vo arXiv PDF

abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 59 background 57 baseline 9 dataset 3 other 1

citation-polarity summary

background 57 use method 57 baseline 9 unclear 4 use dataset 2

claims ledger

abstract The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques

authors

Huy Vo Marc Szafraniec Maxime Oquab Th\'eo Moutakanni Timoth\'ee Darcet Vasil Khalidov

co-cited works

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

X-Palm: Paired Multispectral-to-Smartphone Dataset for Cross-Domain Palmprint Authentication

eess.IV · 2026-06-07 · unverdicted · novelty 8.0

X-Palm supplies the first paired multispectral-to-smartphone palmprint dataset with broad real-world variability to support cross-domain biometric authentication.

Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects

cs.CV · 2026-05-27 · conditional · novelty 8.0

Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.

CalibAnyView: Beyond Single-View Camera Calibration in the Wild

cs.CV · 2026-05-14 · conditional · novelty 8.0

A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

cs.GR · 2026-05-13 · unverdicted · novelty 8.0

Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

cs.CV · 2026-04-13 · unverdicted · novelty 8.0

The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Show Me Examples: Inferring Visual Concepts from Image Sets

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

Introduces VICIS task and training framework for inferring visual concepts from image sets, with experiments showing better accuracy, diversity, and generalization than standard VLMs on synthetic and ImageNet data.

InvSplat: Inverse Feed-Forward Scene Splatting

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

InvSplat is a feed-forward multi-view model that predicts 3D Gaussians augmented with intrinsic material attributes for inverse rendering and relighting.

Understanding Geometric Representations in Self-Supervised Vision Transformers via Subspace Intervention

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

The subspace intervention framework reveals that pre-training objectives shape how ViTs encode geometric information in compressible low-rank subspaces, with peak precision at intermediate layers.

From Forgeries to Foundation Models: A Systematic Survey of Identity Document Attack and Detection

cs.CR · 2026-07-01 · unverdicted · novelty 7.0

A systematic survey unifies presentation, digital injection, and GenAI synthesis attacks on identity documents, audits datasets for a reality gap, identifies SDGI in multimodal models, and reports APCER above 25% for top models on synthetic IDs.

Prototype Memory-Guided Training-Free Anomaly Classification and Localization in Prenatal Ultrasound

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

A training-free prototype memory-guided framework for multi-class prenatal ultrasound anomaly classification and localization using few reference images per class, validated on a 9-category multi-center dataset.

EPO: Boosting 3D Foundation Models with Edge-based Pose Optimization

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

EPO is a trackless, edge-map-alignment framework that refines pose estimates from 3D foundation models and matches or exceeds bundle-adjustment performance with substantially lower runtime and memory use.

GEAR: Guided End-to-End AutoRegression for Image Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.

WarpHammer: Densifying Scene Warps with 3D Object Priors for Extreme View Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.

AnyMatch: Supercharging Universal Multi-Modal Image Matching with Large-Scale Single-View Images

cs.CV · 2026-06-30 · unverdicted · novelty 7.0 · 2 refs

AnyMatch synthesizes large-scale geometrically consistent multi-modal image pairs from single-view images, enabling fine-tuned matching networks to achieve substantial gains on benchmarks.

Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

A new dataset of 220k+ cross-view pairs and a single-stage geometry-aware model GAGeo based on the π³ 3D foundation model outperforms prior methods on object geo-localization with strong generalization and zero-shot ground-to-drone capability.

Complete virtual unwrapping and reading of a rolled Herculaneum papyrus

eess.IV · 2026-06-27 · unverdicted · novelty 7.0

First complete digital unwrapping and reading of a Herculaneum papyrus scroll (PHerc. 1667) via synchrotron X-ray CT, virtual unrolling, and machine learning.

Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors

cs.RO · 2026-06-26 · conditional · novelty 7.0

Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.

A Unified Framework for Vision Transformers Equivariant to Discrete Subgroups of $\mathrm{O}(2)$

cs.CV · 2026-06-26 · unverdicted · novelty 7.0 · 2 refs

Constructs G-equivariant ViTs for arbitrary discrete G ≤ O(2), proves H ≤ G implies G-models embed into H-models and single-head equivariant attention realizes all ordinary G-equivariant maps, introduces D6 hexagonal model, and reports preliminary accuracy gains on PatternNet in low-data regimes.

citing papers explorer

Showing 50 of 54 citing papers after filters.

DataComp-VLM: Improved Open Datasets for Vision-Language Models cs.CV · 2026-06-26 · conditional · none · ref 231 · 2 links · internal anchor
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects cs.CV · 2026-05-27 · conditional · none · ref 36 · internal anchor
Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.
CalibAnyView: Beyond Single-View Camera Calibration in the Wild cs.CV · 2026-05-14 · conditional · none · ref 35 · internal anchor
A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.
On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models cs.CR · 2026-05-10 · conditional · none · ref 26 · internal anchor
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors cs.RO · 2026-06-26 · conditional · none · ref 51 · internal anchor
Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.
Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models cs.CV · 2026-05-19 · conditional · none · ref 4 · internal anchor
Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation cs.RO · 2026-05-12 · conditional · none · ref 21 · internal anchor
A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
When Style Similarity Scores Fail: Diagnosing Raw CSD Cosine in Artist-Style Evaluation cs.CV · 2026-05-09 · conditional · none · ref 9 · internal anchor
Raw CSD cosine similarity produces negative discrimination gaps for many artists and does not support absolute style-fidelity interpretation, but CSLS readout on frozen backbones reduces failures and improves AUC.
InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery cs.CV · 2026-05-06 · conditional · none · ref 7 · 2 links · internal anchor
InterMesh explicitly incorporates human-object interaction semantics into multi-person mesh recovery via a detector and two lightweight modules, delivering up to 9.9% MPJPE reduction on interaction-heavy datasets.
VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization cs.CV · 2026-04-14 · conditional · none · ref 1 · internal anchor
VideoFlexTok introduces coarse-to-fine variable-length video tokenization that enables 5x smaller models to match grid-based generation quality and supports 10-second videos with 8x fewer tokens.
Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection cs.CV · 2026-04-13 · conditional · none · ref 54 · internal anchor
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal cs.CV · 2026-04-07 · conditional · none · ref 3 · internal anchor
Visual encoders leak identity information; a one-shot linear subspace removal method (ISP) reduces leakage to near-chance levels while retaining high non-biometric utility across datasets.
Training a Student Expert via Semi-Supervised Foundation Model Distillation cs.CV · 2026-04-04 · conditional · none · ref 31 · internal anchor
A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.
Beauty in the Eye of AI: Aligning LLMs and Vision Models with Human Aesthetics in Network Visualization cs.LG · 2026-04-03 · conditional · none · ref 3 · internal anchor
LLMs and vision models achieve human-human alignment levels in judging network visualization aesthetics through prompt engineering on a new dataset of human preferences from 27 participants.
VOSR: A Vision-Only Generative Model for Image Super-Resolution cs.CV · 2026-04-03 · conditional · none · ref 25 · internal anchor
VOSR shows that competitive generative image super-resolution with faithful structures can be achieved by training a diffusion-style model from scratch on visual data alone, using a vision encoder for guidance and a restoration-oriented sampling strategy.
Satellite-Free Training for Drone-View Geo-Localization cs.CV · 2026-04-02 · conditional · none · ref 27 · internal anchor
A satellite-free training framework reconstructs 3D drone scenes via Gaussian splatting, generates geometry-normalized pseudo-orthophotos, and aggregates DINOv3 features with a Fisher vector model trained only on drone data to enable cross-view retrieval.
Emergent Compositional Communication for Latent World Properties cs.MA · 2026-03-18 · conditional · none · ref 24 · internal anchor
Multi-agent iterated learning produces emergent positionally disentangled communication protocols for latent physical properties from unsupervised video features.
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control cs.RO · 2026-03-18 · conditional · none · ref 20 · internal anchor
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
UIKA: Fast Universal Head Avatar from Pose-Free Images cs.CV · 2026-01-12 · conditional · none · ref 52 · internal anchor
UIKA is a feed-forward animatable Gaussian head model using UV-guided correspondence estimation and learnable UV tokens with dual-level attention, trained on large-scale synthetic data to handle pose-free inputs.
MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding cs.CV · 2025-12-19 · conditional · none · ref 60 · internal anchor
MMLandmarks supplies 197k aerial and 329k ground images plus text and GPS for 18,557 landmarks to benchmark multimodal geo-spatial understanding.
From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity cs.LG · 2025-12-02 · conditional · none · ref 33 · internal anchor
Flow matching models follow a two-stage process of navigation across data modes then refinement to nearest samples, revealed by exact computation of the oracle marginal velocity field.
FastVGGT: Training-Free Acceleration of Visual Geometry Transformer cs.CV · 2025-09-02 · conditional · none · ref 14 · internal anchor
FastVGGT achieves 4x speedup on VGGT for 1000-image inputs using training-free token merging tailored to 3D architectures while reducing error accumulation.
$\pi^3$: Permutation-Equivariant Visual Geometry Learning cs.CV · 2025-07-17 · conditional · none · ref 5 · internal anchor
π³ is a feed-forward network with full permutation equivariance that outputs affine-invariant poses and scale-invariant local point maps without reference frames, reaching state-of-the-art on camera pose, depth, and dense reconstruction benchmarks.
SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples cs.CV · 2025-07-10 · conditional · none · ref 52 · internal anchor
SCOOTER supplies best-practice guidelines, open tools, and a 3K-image benchmark with 34K+ human ratings showing that six tested unrestricted attacks produce images humans can detect as fake.
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation cs.RO · 2024-09-03 · conditional · none · ref 5 · internal anchor
ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone cs.LG · 2026-05-12 · conditional · none · ref 28 · 2 links · internal anchor
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
Lightweight Unpaired Smartphone ISP Transfer with Semantic Pseudo-Pairing cs.CV · 2026-05-08 · conditional · none · ref 26 · internal anchor
Semantic pseudo-pairing via DINOv2 embeddings and fused Gromov-Wasserstein optimal transport enables training a 7K-parameter CNN for unpaired smartphone ISP, achieving 22.569 PSNR on the NTIRE 2026 challenge test set.
Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping cs.CV · 2026-05-07 · conditional · none · ref 46 · internal anchor
Mixing real UAV imagery with 2101 AI-generated image-mask pairs improves semantic segmentation F1 scores for fine-grained forest species by over 15 percentage points overall and up to 30 points for rare classes.
Parameter-Efficient Adaptation of Pre-Trained Vision Foundation Models for Active and Passive Seismic Data Denoising physics.geo-ph · 2026-04-30 · conditional · none · ref 58 · internal anchor
Adapting vision foundation models with LoRA and kurtosis-guided unsupervised test-time adaptation matches or exceeds domain-specific models for seismic denoising across multiple sites and unseen data.
Euclid Quick Data Release (Q1). AstroVink: A vision transformer approach to find strong gravitational lens systems astro-ph.IM · 2026-04-23 · conditional · none · ref 51 · internal anchor
A vision transformer classifier trained on simulated and real Euclid data recovers all known strong lenses in test sets and finds 8 Grade A plus 26 Grade B new candidates in the Q1 data.
SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation cs.CV · 2026-04-15 · conditional · none · ref 48 · internal anchor
SceneGlue introduces a scene-aware feature matching framework using implicit parallel attention and an explicit Visibility Transformer, trained solely on local feature matches to enhance cross-view correspondence without scene annotations.
Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training cs.LG · 2026-04-08 · conditional · none · ref 24 · internal anchor
Data Warmup accelerates diffusion training on ImageNet by scheduling images from low to high complexity via a foreground-based metric and temperature-controlled sampler, improving FID and IS scores faster than uniform sampling.
Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks cs.CV · 2026-04-05 · conditional · none · ref 20 · internal anchor
VoxelFM learns robust 3D CT visual features via DINO self-distillation that transfer effectively to seven clinical task categories using frozen backbones and lightweight heads, outperforming prior CT foundation models even on report generation.
Automated Segmentation and Tracking of Group Housed Pigs Using Foundation Models cs.CV · 2026-04-03 · conditional · none · ref 2 · internal anchor
Foundation models plus modular tracking achieve stable long-term pig segmentation and identity maintenance on farm videos with 0.99 MOTA and no switches.
Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models cs.CV · 2026-02-02 · conditional · none · ref 20 · internal anchor
Frozen features from vision foundation models enable a linear probe to outperform specialized AIGI detectors by over 30% on in-the-wild data due to emergent forgery knowledge from pre-training.
Mirai: Autoregressive Visual Generation Needs Foresight cs.CV · 2026-01-21 · conditional · none · ref 28 · internal anchor
Mirai injects future-token foresight into autoregressive visual generators, accelerating convergence up to 10x and cutting ImageNet FID from 5.34 to 4.34.
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation cs.CV · 2025-11-24 · conditional · none · ref 41 · internal anchor
DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while closing the gap to latent diffusion methods.
Unify Robot Actions in Camera Frame cs.RO · 2025-11-21 · conditional · none · ref 29 · internal anchor
CalibAll estimates camera extrinsics on existing datasets to convert robot actions into a unified camera-frame representation, enabling stronger cross-embodiment pretraining.
LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics cs.LG · 2025-11-11 · conditional · none · ref 51 · internal anchor
LeJEPA derives an optimal isotropic Gaussian target for embeddings and enforces it via sketched regularization to deliver scalable, heuristics-free self-supervised pretraining with 79% ImageNet linear accuracy on ViT-H/14.
VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance cs.CV · 2025-10-08 · conditional · none · ref 23 · internal anchor
VA-Adapter adapts ultrasound foundation models for echocardiography probe guidance by embedding a vision-action module that infers individual 3D cardiac anatomy from historical sequences, outperforming prior methods with roughly 33 times fewer trainable parameters on a 1.31 million sample dataset.
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation cs.RO · 2025-08-26 · conditional · none · ref 17 · internal anchor
MemoryVLA introduces a perceptual-cognitive memory bank and working-memory retrieval mechanism into VLA models, raising success rates on long-horizon robotic tasks by up to 26 points over prior baselines.
Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters? cs.CV · 2025-07-14 · conditional · none · ref 33 · internal anchor
The ITW-SM dataset and targeted optimization of detector design choices yield a 26.87% average AUC improvement for state-of-the-art AI-generated image detectors under real-world social media conditions.
AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting cs.CV · 2025-06-01 · conditional · none · ref 49 · internal anchor
AuralSAM2 fuses audio-visual features via a pyramid-based AuralFuser module and audio-guided contrastive loss to improve promptable segmentation accuracy in SAM2 with minimal efficiency impact.
ImgEdit: A Unified Image Editing Dataset and Benchmark cs.CV · 2025-05-26 · conditional · none · ref 52 · internal anchor
ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning cs.RO · 2025-05-24 · conditional · none · ref 55 · internal anchor
VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
Policy Contrastive Decoding for Robotic Foundation Models cs.RO · 2025-05-19 · conditional · none · ref 5 · internal anchor
PCD redirects robotic policies toward object-relevant visual features via contrastive decoding on masked inputs, improving generalization without retraining or weight access.
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps cs.CV · 2025-01-16 · conditional · none · ref 53 · internal anchor
Diffusion models improve generation quality via inference-time search over noise candidates guided by verifiers and algorithms, yielding gains beyond denoising step scaling on class- and text-conditioned benchmarks.
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies cs.RO · 2024-12-13 · conditional · none · ref 90 · internal anchor
Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.
Revisiting Feature Prediction for Learning Visual Representations from Video cs.CV · 2024-02-15 · conditional · none · ref 263 · internal anchor
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient Spectra cs.LG · 2026-05-23 · conditional · none · ref 27 · internal anchor
Muon optimizer outperforms AdamW in ViT training on two image datasets, with gains that depend on data augmentation strength and are linked to wider singular-value spread in QKV gradients and prevention of late-training mode collapse in MLP blocks.

DINOv2: Learning Robust Visual Features without Supervision

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer