super hub Mixed citations

DINOv2: Learning Robust Visual Features without Supervision

Huy Vo, Marc Szafraniec, Maxime Oquab, Vasil Khalidov · 2023 · cs.CV · arXiv 2304.07193

Mixed citation behavior. Most common role is background (44%).

815 Pith papers citing it

Background 44% of classified citations

open full Pith review browse 815 citing papers more from Huy Vo arXiv PDF

abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 59 background 57 baseline 9 dataset 3 other 1

citation-polarity summary

background 57 use method 57 baseline 9 unclear 4 use dataset 2

claims ledger

abstract The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques

authors

Huy Vo Marc Szafraniec Maxime Oquab Th\'eo Moutakanni Timoth\'ee Darcet Vasil Khalidov

co-cited works

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

WildBox: A Dataset and Benchmark for Aerial Monocular 3D Detection of African Savanna Wildlife

cs.CV · 2026-06-19 · unverdicted · novelty 8.0

WildBox provides over 237k 3D wildlife annotations from drone video and benchmarks reveal zero-shot 3D detection at 0 AP but fine-tuned performance of 8.68 AP-BEV and 13.17 AP3D, with depth estimation causing most errors.

X-Palm: Paired Multispectral-to-Smartphone Dataset for Cross-Domain Palmprint Authentication

eess.IV · 2026-06-07 · unverdicted · novelty 8.0

X-Palm supplies the first paired multispectral-to-smartphone palmprint dataset with broad real-world variability to support cross-domain biometric authentication.

Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects

cs.CV · 2026-05-27 · conditional · novelty 8.0

Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.

CalibAnyView: Beyond Single-View Camera Calibration in the Wild

cs.CV · 2026-05-14 · conditional · novelty 8.0

A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

cs.GR · 2026-05-13 · unverdicted · novelty 8.0

Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

cs.CV · 2026-04-13 · unverdicted · novelty 8.0

The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Show Me Examples: Inferring Visual Concepts from Image Sets

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

Introduces VICIS task and training framework for inferring visual concepts from image sets, with experiments showing better accuracy, diversity, and generalization than standard VLMs on synthetic and ImageNet data.

InvSplat: Inverse Feed-Forward Scene Splatting

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

InvSplat is a feed-forward multi-view model that predicts 3D Gaussians augmented with intrinsic material attributes for inverse rendering and relighting.

Understanding Geometric Representations in Self-Supervised Vision Transformers via Subspace Intervention

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

The subspace intervention framework reveals that pre-training objectives shape how ViTs encode geometric information in compressible low-rank subspaces, with peak precision at intermediate layers.

From Forgeries to Foundation Models: A Systematic Survey of Identity Document Attack and Detection

cs.CR · 2026-07-01 · unverdicted · novelty 7.0

A systematic survey unifies presentation, digital injection, and GenAI synthesis attacks on identity documents, audits datasets for a reality gap, identifies SDGI in multimodal models, and reports APCER above 25% for top models on synthetic IDs.

Prototype Memory-Guided Training-Free Anomaly Classification and Localization in Prenatal Ultrasound

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

A training-free prototype memory-guided framework for multi-class prenatal ultrasound anomaly classification and localization using few reference images per class, validated on a 9-category multi-center dataset.

EPO: Boosting 3D Foundation Models with Edge-based Pose Optimization

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

EPO is a trackless, edge-map-alignment framework that refines pose estimates from 3D foundation models and matches or exceeds bundle-adjustment performance with substantially lower runtime and memory use.

GEAR: Guided End-to-End AutoRegression for Image Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.

WarpHammer: Densifying Scene Warps with 3D Object Priors for Extreme View Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.

AnyMatch: Supercharging Universal Multi-Modal Image Matching with Large-Scale Single-View Images

cs.CV · 2026-06-30 · unverdicted · novelty 7.0 · 2 refs

AnyMatch synthesizes large-scale geometrically consistent multi-modal image pairs from single-view images, enabling fine-tuned matching networks to achieve substantial gains on benchmarks.

Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

A new dataset of 220k+ cross-view pairs and a single-stage geometry-aware model GAGeo based on the π³ 3D foundation model outperforms prior methods on object geo-localization with strong generalization and zero-shot ground-to-drone capability.

Complete virtual unwrapping and reading of a rolled Herculaneum papyrus

eess.IV · 2026-06-27 · unverdicted · novelty 7.0

First complete digital unwrapping and reading of a Herculaneum papyrus scroll (PHerc. 1667) via synchrotron X-ray CT, virtual unrolling, and machine learning.

Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors

cs.RO · 2026-06-26 · conditional · novelty 7.0

Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.

citing papers explorer

Showing 50 of 815 citing papers.

Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks cs.CV · 2026-05-16 · unverdicted · none · ref 25 · internal anchor
A label-free metric-guided fusion of complementary features from visual foundation models yields consistent gains in dense prediction tasks with improved object semantics and boundary localization.
LACE: Latent Visual Representation for Cross-Embodiment Learning cs.RO · 2026-05-16 · unverdicted · none · ref 38 · internal anchor
LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.
GeoWorld-VLM: Geometry from World Models for Vision-Language Models cs.CV · 2026-05-15 · unverdicted · none · ref 32 · 2 links · internal anchor
GeoWorld-VLM aligns VLM image features with intermediate representations from camera-conditioned world models via fine-tuning only the encoder and projector, yielding ~4% gains on What'sUp and VSR spatial benchmarks across two VLM backbones.
Registers Matter for Pixel-Space Diffusion Transformers cs.CV · 2026-05-15 · unverdicted · none · ref 6 · internal anchor
Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.
Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer cs.CV · 2026-05-15 · unverdicted · none · ref 6 · 2 links · internal anchor
FGQ applies diagonal Fisher information to guide learnable affine transformations in PTQ for multi-task VGGT, yielding up to 39% relative gains over baselines at 4-bit quantization.
DiLA: Disentangled Latent Action World Models cs.CV · 2026-05-15 · unverdicted · none · ref 22 · internal anchor
DiLA uses content-structure disentanglement driven by predictive bottlenecks to create semantically structured latent actions for high-fidelity video world models.
Efficient Image Synthesis with Sphere Latent Encoder cs.CV · 2026-05-15 · unverdicted · none · ref 28 · internal anchor
Decouples Sphere Encoder into fixed pretrained encoder and spherical latent denoiser, yielding higher quality and faster inference than the joint original on Animal-Faces, Oxford-Flowers and ImageNet-1K.
Aligning Latent Geometry for Spherical Flow Matching in Image Generation cs.CV · 2026-05-14 · unverdicted · none · ref 31 · internal anchor
Projecting VAE latents to a fixed spherical radius and replacing linear interpolation with spherical linear interpolation improves class-conditional ImageNet-256 FID while leaving the diffusion architecture unchanged.
Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image cs.CV · 2026-05-14 · unverdicted · none · ref 70 · internal anchor
Sat3DGen improves geometric RMSE from 6.76m to 5.20m and FID from ~40 to 19 for street-level 3D generation from satellite images via geometry-centric constraints and perspective training.
H-OmniStereo: Zero-Shot Omnidirectional Stereo Matching with Heading-Aligned Normal Priors cs.CV · 2026-05-14 · unverdicted · none · ref 38 · internal anchor
H-OmniStereo trains a stereo matcher on 2.8 million synthetic equirectangular pairs and adds a heading-aligned normal prior to improve zero-shot accuracy and generalization on out-of-domain and real omnidirectional data.
TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation cs.CV · 2026-05-14 · unverdicted · none · ref 112 · internal anchor
TOPOS creates high-fidelity 3D heads with fixed industry topology from single images via a specialized VAE with Perceiver Resampler and a rectified flow transformer.
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction cs.CV · 2026-05-14 · unverdicted · none · ref 72 · internal anchor
CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
Learning with Semantic Priors: Stabilizing Point-Supervised Infrared Small Target Detection via Hierarchical Knowledge Distillation cs.CV · 2026-05-14 · unverdicted · none · ref 9 · internal anchor
A hierarchical VFM-driven knowledge distillation method with semantic-conditioned modulation and cluster reweighting stabilizes point-supervised infrared small target detection and improves accuracy.
VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence cs.CV · 2026-05-13 · unverdicted · none · ref 14 · internal anchor
VoxCor creates reusable volumetric features from frozen 2D ViT models by combining triplanar inference with a closed-form weighted partial least squares projection, enabling direct voxel correspondence across modalities without training or registration.
Qwen-Image-VAE-2.0 Technical Report cs.CV · 2026-05-13 · unverdicted · none · ref 13 · internal anchor
Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.
Backbone is All You Need: Assessing Vulnerabilities of Frozen Foundation Models in Synthetic Image Forensics cs.CV · 2026-05-13 · unverdicted · none · ref 25 · internal anchor
Knowledge of the ViT backbone alone enables highly effective gray-box adversarial attacks on synthetic image detectors, often nearing white-box performance.
Contrastive-SDXL: Annotation-Preserving Night-Time Augmentation for Pedestrian Detection cs.CV · 2026-05-13 · unverdicted · none · ref 6 · internal anchor
Contrastive-SDXL augments daytime images into realistic night-time versions using SDXL-Turbo with LoRA and multi-level DINOv2 contrastive losses, yielding 6-7% lower miss rate on pedestrian detection versus daytime-only training.
Anatomy-Slot: Unsupervised Anatomical Factorization for Homologous Bilateral Reasoning in Retinal Diagnosis cs.CV · 2026-05-13 · unverdicted · none · ref 15 · 2 links · internal anchor
Anatomy-Slot decomposes patch tokens into emergent anatomical slots and aligns them across eyes via bidirectional cross-attention, reporting a 4.2 AUC gain over ViT-L on ODIR-5K.
CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis cs.CV · 2026-05-12 · unverdicted · none · ref 14 · internal anchor
CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.
Elastic Attention Cores for Scalable Vision Transformers cs.CV · 2026-05-12 · unverdicted · none · ref 135 · internal anchor
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
From Image Hashing to Scene Change Detection cs.CV · 2026-05-12 · unverdicted · none · ref 23 · internal anchor
HashSCD is a patch-wise hashing method for unsupervised scene change detection and localization that operates directly in Hamming space with competitive performance and lower computational cost.
TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning cs.RO · 2026-05-12 · unverdicted · none · ref 64 · internal anchor
TMRL bridges behavioral cloning pretraining and RL finetuning via diffusion noise and timestep modulation to enable controlled exploration, improving sample efficiency and enabling real-world robot training in under one hour.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone cs.LG · 2026-05-12 · conditional · none · ref 28 · 2 links · internal anchor
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs cs.CV · 2026-05-11 · unverdicted · none · ref 28 · internal anchor
Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation with no minority examples in training.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 39 · internal anchor
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth cs.CV · 2026-05-11 · unverdicted · none · ref 10 · 4 links · internal anchor
GemDepth adds explicit camera-pose geometry embeddings and an alternating spatio-temporal transformer to produce sharper, more temporally consistent video depth maps than prior smoothing-based methods.
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition cs.CV · 2026-05-11 · unverdicted · none · ref 35 · 2 links · internal anchor
Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.
Hystar: Hypernetwork-driven Style-adaptive Retrieval via Dynamic SVD Modulation cs.CV · 2026-05-11 · unverdicted · none · ref 16 · internal anchor
Hystar adapts CLIP-like models to unseen query styles by generating per-input singular-value perturbations with a hypernetwork for attention layers and a new StyleNCE contrastive loss.
DiffATS: Diffusion in Aligned Tensor Space cs.LG · 2026-05-10 · unverdicted · none · ref 40 · internal anchor
DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with high compression.
Event Fields: Learning Latent Event Structure for Waveform Foundation Models cs.LG · 2026-05-09 · unverdicted · none · ref 24 · internal anchor
Event-centric waveform foundation models are learned via self-supervised consistency on latent event structures and interactions, yielding improved performance and label efficiency over sequence-based baselines on physiological tasks.
Generative 3D Gaussians with Learned Density Control cs.GR · 2026-05-08 · unverdicted · none · ref 33 · internal anchor
DeG models 3D Gaussians via learned octree density and uses VecSeq Sobol re-indexing to turn set generation into sequence modeling, claiming SOTA quality in single-image-to-3D.
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation cs.CV · 2026-05-08 · unverdicted · none · ref 20 · internal anchor
STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
Rethinking Dense Optical Flow without Test-Time Scaling cs.CV · 2026-05-08 · unverdicted · none · ref 28 · internal anchor
Dense optical flow can be estimated accurately in one forward pass by combining DINO-v2 semantic priors and monocular depth geometric cues via global matching, reaching 2.81 EPE on Sintel Final without any refinement.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 57 · internal anchor
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
Lightweight Unpaired Smartphone ISP Transfer with Semantic Pseudo-Pairing cs.CV · 2026-05-08 · conditional · none · ref 26 · internal anchor
Semantic pseudo-pairing via DINOv2 embeddings and fused Gromov-Wasserstein optimal transport enables training a 7K-parameter CNN for unpaired smartphone ISP, achieving 22.569 PSNR on the NTIRE 2026 challenge test set.
Predictive but Not Plannable: RC-aux for Latent World Models cs.LG · 2026-05-08 · unverdicted · none · ref 32 · internal anchor
RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
InfoGeo: Information-Theoretic Object-Centric Learning for Cross-View Generalizable UAV Geo-Localization cs.CV · 2026-05-08 · unverdicted · none · ref 42 · internal anchor
InfoGeo reformulates cross-view geo-localization as an information bottleneck that aligns object-centric structural relations across views while suppressing view-specific noise.
ProCompNav: Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries cs.AI · 2026-05-07 · unverdicted · none · ref 25 · 3 links · internal anchor
ProCompNav builds a candidate pool from ambiguous queries then uses pool-splitting binary questions for disambiguation, improving success rate and shortening responses on CoIN-Bench and TextNav.
Learning a Delighting Prior for Facial Appearance Capture in the Wild cs.CV · 2026-05-07 · unverdicted · none · ref 170 · internal anchor
A delighting network trained via Dataset Latent Modulation on heterogeneous OLAT and Light Stage data enables high-quality in-the-wild facial reflectance capture from video and produces the NeRSemble-Scan dataset.
Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping cs.CV · 2026-05-07 · conditional · none · ref 46 · internal anchor
Mixing real UAV imagery with 2101 AI-generated image-mask pairs improves semantic segmentation F1 scores for fine-grained forest species by over 15 percentage points overall and up to 30 points for rare classes.
Taming Outlier Tokens in Diffusion Transformers cs.CV · 2026-05-06 · unverdicted · none · ref 18 · internal anchor
Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation cs.RO · 2026-05-06 · unverdicted · none · ref 55 · internal anchor
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
DART: A Vision-Language Foundation Model for Comprehensive Rope Condition Monitoring cs.CV · 2026-05-06 · unverdicted · none · ref 14 · internal anchor
DART is a cross-modal foundation model that delivers rope damage classification, severity regression, and few-shot recognition from a single frozen representation trained on 4270 images across 14 damage classes.
Reference-based Category Discovery: Unsupervised Object Detection with Category Awareness cs.CV · 2026-05-06 · unverdicted · none · ref 7 · internal anchor
RefCD enables unsupervised category-aware object detection by using feature similarity between predicted objects and unlabeled reference images to guide category learning.
Intermediate Representations are Strong AI-Generated Image Detectors cs.CV · 2026-05-05 · unverdicted · none · ref 40 · internal anchor
Intermediate layer embedding sensitivity to perturbations distinguishes AI-generated images from real ones, yielding higher AUROC on GenImage and Forensics Small benchmarks than prior methods.
Stylistic Attribute Control in Latent Diffusion Models cs.CV · 2026-05-04 · unverdicted · none · ref 9 · internal anchor
A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation cs.RO · 2026-05-02 · unverdicted · none · ref 17 · internal anchor
Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
InpaintSLat: Inpainting Structured 3D Latents via Initial Noise Optimization cs.CV · 2026-05-01 · unverdicted · none · ref 14 · internal anchor
Optimizing initial noise via backpropagation approximation and spectral parameterization in structured 3D latent diffusion yields higher contextual consistency and prompt alignment in training-free inpainting.
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer cs.CV · 2026-05-01 · unverdicted · none · ref 24 · internal anchor
An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.
MeshReGen: A Unified 3D Geometry Regeneration Framework cs.CV · 2026-04-30 · unverdicted · none · ref 43 · 2 links · internal anchor
MeshReGen introduces a conditioned 3D geometry regenerator with VecSet that learns a regeneration prior via self-supervision and reports state-of-the-art results on controllable generation tasks.

DINOv2: Learning Robust Visual Features without Supervision

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer