super hub Mixed citations

DINOv2: Learning Robust Visual Features without Supervision

Huy Vo, Marc Szafraniec, Maxime Oquab, Vasil Khalidov · 2023 · cs.CV · arXiv 2304.07193

Mixed citation behavior. Most common role is background (44%).

792 Pith papers citing it

Background 44% of classified citations

open full Pith review browse 792 citing papers more from Huy Vo arXiv PDF

abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 59 background 57 baseline 9 dataset 3 other 1

citation-polarity summary

background 57 use method 57 baseline 9 unclear 4 use dataset 2

claims ledger

abstract The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques

authors

Huy Vo Marc Szafraniec Maxime Oquab Th\'eo Moutakanni Timoth\'ee Darcet Vasil Khalidov

co-cited works

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

X-Palm: Paired Multispectral-to-Smartphone Dataset for Cross-Domain Palmprint Authentication

eess.IV · 2026-06-07 · unverdicted · novelty 8.0

X-Palm supplies the first paired multispectral-to-smartphone palmprint dataset with broad real-world variability to support cross-domain biometric authentication.

Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects

cs.CV · 2026-05-27 · conditional · novelty 8.0

Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.

CalibAnyView: Beyond Single-View Camera Calibration in the Wild

cs.CV · 2026-05-14 · conditional · novelty 8.0

A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

cs.GR · 2026-05-13 · unverdicted · novelty 8.0

Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

cs.CV · 2026-04-13 · unverdicted · novelty 8.0

The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Show Me Examples: Inferring Visual Concepts from Image Sets

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

Introduces VICIS task and training framework for inferring visual concepts from image sets, with experiments showing better accuracy, diversity, and generalization than standard VLMs on synthetic and ImageNet data.

InvSplat: Inverse Feed-Forward Scene Splatting

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

InvSplat is a feed-forward multi-view model that predicts 3D Gaussians augmented with intrinsic material attributes for inverse rendering and relighting.

Understanding Geometric Representations in Self-Supervised Vision Transformers via Subspace Intervention

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

The subspace intervention framework reveals that pre-training objectives shape how ViTs encode geometric information in compressible low-rank subspaces, with peak precision at intermediate layers.

From Forgeries to Foundation Models: A Systematic Survey of Identity Document Attack and Detection

cs.CR · 2026-07-01 · unverdicted · novelty 7.0

A systematic survey unifies presentation, digital injection, and GenAI synthesis attacks on identity documents, audits datasets for a reality gap, identifies SDGI in multimodal models, and reports APCER above 25% for top models on synthetic IDs.

Prototype Memory-Guided Training-Free Anomaly Classification and Localization in Prenatal Ultrasound

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

A training-free prototype memory-guided framework for multi-class prenatal ultrasound anomaly classification and localization using few reference images per class, validated on a 9-category multi-center dataset.

EPO: Boosting 3D Foundation Models with Edge-based Pose Optimization

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

EPO is a trackless, edge-map-alignment framework that refines pose estimates from 3D foundation models and matches or exceeds bundle-adjustment performance with substantially lower runtime and memory use.

GEAR: Guided End-to-End AutoRegression for Image Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.

WarpHammer: Densifying Scene Warps with 3D Object Priors for Extreme View Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.

AnyMatch: Supercharging Universal Multi-Modal Image Matching with Large-Scale Single-View Images

cs.CV · 2026-06-30 · unverdicted · novelty 7.0 · 2 refs

AnyMatch synthesizes large-scale geometrically consistent multi-modal image pairs from single-view images, enabling fine-tuned matching networks to achieve substantial gains on benchmarks.

Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

A new dataset of 220k+ cross-view pairs and a single-stage geometry-aware model GAGeo based on the π³ 3D foundation model outperforms prior methods on object geo-localization with strong generalization and zero-shot ground-to-drone capability.

Complete virtual unwrapping and reading of a rolled Herculaneum papyrus

eess.IV · 2026-06-27 · unverdicted · novelty 7.0

First complete digital unwrapping and reading of a Herculaneum papyrus scroll (PHerc. 1667) via synchrotron X-ray CT, virtual unrolling, and machine learning.

Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors

cs.RO · 2026-06-26 · conditional · novelty 7.0

Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.

A Unified Framework for Vision Transformers Equivariant to Discrete Subgroups of $\mathrm{O}(2)$

cs.CV · 2026-06-26 · unverdicted · novelty 7.0 · 2 refs

Constructs G-equivariant ViTs for arbitrary discrete G ≤ O(2), proves H ≤ G implies G-models embed into H-models and single-head equivariant attention realizes all ordinary G-equivariant maps, introduces D6 hexagonal model, and reports preliminary accuracy gains on PatternNet in low-data regimes.

citing papers explorer

Showing 50 of 792 citing papers.

MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation cs.CV · 2026-06-24 · unverdicted · none · ref 30 · internal anchor
MIMFlow uses a VAE on masked images to feed semantic latents to a normalizing flow while a decoder handles high-frequency details, reporting FID 2.50 and 71.3% linear probing on ImageNet 256x256 with 128 tokens.
Invoice Haystack: Benchmarking Document Retrieval and Visual Question Answering Under Strong Visual Homogeneity cs.CV · 2026-06-24 · unverdicted · none · ref 56 · internal anchor
Presents Invoice Haystack benchmark for homogeneous document retrieval and VL-RAG hybrid framework achieving 60% Recall@1 and up to 13.5 point gains over prior methods.
RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer cs.CV · 2026-06-16 · unverdicted · none · ref 20 · internal anchor
RegimeVGGT applies layer-wise U-shaped compression via saliency-guided banded merging and selectively protected K/V downsampling to deliver 6.7x speedup on VGGT at matched reconstruction quality.
SierpinskiCam: Camera-Controlled Video Retaking with Sierpinski Triangle Pattern Cues cs.CV · 2026-06-15 · unverdicted · none · ref 15 · internal anchor
SierpinskiCam adds Sierpinski dome texture cues and negative-RoPE reference video conditioning to geometry-guided video diffusion to improve camera controllability and consistency in video retaking.
Contrastive Action-Image Pre-training for Visuomotor Control cs.RO · 2026-06-15 · unverdicted · none · ref 4 · internal anchor
CAIP learns action-aligned visual representations via contrastive pre-training on human hand keypoints from egocentric video, outperforming DINOv2, SigLIP, MVP, and R3M with >30% gains on real dexterous manipulation tasks.
Modality Forcing for Scalable Spatial Generation cs.CV · 2026-06-11 · unverdicted · none · ref 27 · internal anchor
Modality Forcing lets a single DiT produce image and depth outputs in any order after training on sparse real-world depth, with larger image-pretrained models yielding better depth accuracy and a 57% AbsRel reduction versus prior joint generative baselines.
HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers cs.CV · 2026-06-11 · unverdicted · none · ref 100 · internal anchor
HYDRA-X presents the first unified multimodal model using a single ViT for holistic image-video tokenization, with ablations on attention and compression plus a latent-level editing improvement.
ViPER: Vision-based Packing-Aware Encoder for Robust Malware Detection cs.CR · 2026-06-11 · unverdicted · none · ref 19 · internal anchor
ViPER uses a LoRA-adapted ViT-B/14 with dual heads for malware classification and packing detection plus a gating mechanism and weighted losses to reach 0.8521 balanced accuracy on 200k Windows PE images while detecting packing at 0.9949 AUC.
Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning cs.CV · 2026-06-10 · unverdicted · none · ref 32 · internal anchor
DSSA decouples per-frame appearance from temporal identity in slot attention mechanisms to reduce slot swapping and improve temporal consistency in video object segmentation.
Action-Effect Memory Pretraining for Robot Manipulation cs.RO · 2026-06-10 · unverdicted · none · ref 42 · internal anchor
AEM pretrains compact history representations via masked modeling on interleaved vision-action sequences to boost downstream robot manipulation in simulation and real settings.
Tac-DINO: Learning Vision-Tactile Features with Patch Alignment cs.CV · 2026-06-10 · unverdicted · none · ref 158 · internal anchor
Tac-DINO constructs a large tactile dataset and Vis-Tac Holographic Matching Benchmark, then proposes Vision-Tactile Patch Alignment (VTPA) methods that outperform non-aligned baselines on local-to-global feature matching.
LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation cs.CV · 2026-06-10 · unverdicted · none · ref 29 · internal anchor
LASA aggregates multi-layer attention from vision transformers to enable weakly supervised open-vocabulary semantic segmentation on scene sketches, reporting mIoU gains of +3.43 to +15.74 on three benchmarks over prior baselines.
Cross-Modal Benchmarking for Robotic Perception in Natural Environments cs.CV · 2026-06-10 · unverdicted · none · ref 18 · internal anchor
Presents the WildCross benchmark with 476K frames for place recognition and metric depth estimation in natural environments, demonstrating limitations of existing vision models.
TacForeSight: Force-Guided Tactile World Model for Contact-Rich Manipulation cs.RO · 2026-06-09 · unverdicted · none · ref 36 · internal anchor
TacForeSight trains a force-conditioned tactile world model to predict latent dynamics and uses those predictions as anticipatory priors inside a visuo-tactile policy for real-time contact-rich manipulation.
Envision4D: Envisioning Visual Futures via Feed-forward 4D Gaussian Splatting for Autonomous Driving cs.CV · 2026-06-09 · unverdicted · none · ref 55 · internal anchor
Envision4D presents a feed-forward 4D Gaussian Splatting framework with future pose prediction, temporal attention, and conditioned motion lifting for pose-free extrapolation in autonomous driving scenes.
Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning cs.LG · 2026-06-09 · unverdicted · none · ref 16 · internal anchor
BFQ enables single-step noise-to-action mapping in offline RL by dividing flow-path displacements into bootstrappable short-range components learned from marginal velocity.
Popcorn: A Configurable Benchmark for Visual Evidence in Multimodal Movie Recommendation cs.IR · 2026-06-08 · unverdicted · none · ref 11 · internal anchor
Popcorn is a new benchmark standardizing modality assembly, fusion, and evaluation of thumbnails, trailers, and full movies encoded by VLMs for multimodal movie recommendation.
See More, Match Better: Multi-Source Feature Fusion for Two-View Correspondence Learning cs.CV · 2026-06-08 · unverdicted · none · ref 18 · internal anchor
TriMatch fuses geometric, texture semantic, and structural semantic features via dedicated alignment and modulation modules to improve inlier-outlier discrimination in two-view correspondence learning.
G2G: Exploiting Intra-Group Geometry for Inter-Group Pose Estimation cs.CV · 2026-06-06 · unverdicted · none · ref 26 · internal anchor
G2G attaches three small trainable modules to frozen backbones and reports state-of-the-art inter-group pose accuracy on four datasets spanning simulation, real cross-season, and sim-to-real transfer using only relative-pose supervision.
DALE-CT: Depth-Aware Foundation Models for Computed Tomography cs.CV · 2026-06-05 · unverdicted · none · ref 18 · internal anchor
DALE-CT, a 2D LeJEPA model with depth-aware dual supervision, reaches 0.833 Macro AUROC on multi-abnormality detection in CT and approaches 3D SOTA performance using less data and no textual supervision.
LARA: Latent Action Representation Alignment for Vision-Language-Action Models cs.CV · 2026-06-05 · unverdicted · none · ref 14 · 2 links · internal anchor
LARA jointly optimizes LAM and VLA models via representation alignment to improve robotic manipulation performance using human videos.
ForensicConcept: Transferable Forensic Concepts for AIGI Detection cs.CV · 2026-06-05 · unverdicted · none · ref 50 · internal anchor
ForensicConcept extracts and transfers forensic concepts from AIGI detectors via Transformer attribution, concept codebooks, CleanDIFT references, and CKNNA alignment to improve detection on unseen generators.
DaX: Learning General Pathology Representations Across Scales eess.IV · 2026-06-05 · unverdicted · none · ref 13 · internal anchor
DaX is a pathology vision foundation model that extends DINOv3 with continuous magnification training and cross-scale consistency, achieving top average performance on a benchmark of 161 tasks from 44 datasets covering 28k patients.
Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy cs.CV · 2026-06-05 · unverdicted · none · ref 9 · internal anchor
DirectAnimator bypasses pose extraction using a Driving Cue Triplet and Same2X training strategy to achieve state-of-the-art human animation quality and robustness from raw videos.
Meridian: Metric-Semantic Primitive Matching for Cross-View Geo-Localization Beyond Urban Environments cs.RO · 2026-06-04 · unverdicted · none · ref 12 · internal anchor
Meridian matches metric-semantic primitives across aerial and ground views for training-free global localization in diverse natural environments, reporting 2.4 m average trajectory error over 19 km.
Geometry-Aware Dataset Condensation for Diffusion Model Training cs.CV · 2026-06-04 · unverdicted · none · ref 56 · internal anchor
A geometry-aware dataset condensation technique reformulates subset selection as one-sided partial optimal transport alignment plus regularization to improve diffusion model training fidelity.
X4Val: Learning Neural Surrogates for Variance-Reduced Policy Evaluation cs.RO · 2026-06-03 · unverdicted · none · ref 21 · internal anchor
X4Val learns transferable neural predictors from non-paired multi-domain data and incorporates them into control-variates estimators to reduce variance in real-world robotic policy evaluation by up to 38.4%.
TransTac: Visuo-Tactile Modality Transition via Ultraviolet-Encoded Transparent Elastomers cs.RO · 2026-06-03 · unverdicted · none · ref 30 · internal anchor
TransTac is a transparent UV-encoded binocular vision-based tactile sensor that integrates visual and marker-based tactile reconstruction, achieving 83.3% zero-shot recognition accuracy and stronger cross-modal alignment than opaque baselines.
Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models cs.CV · 2026-06-03 · unverdicted · none · ref 2 · internal anchor
GPUA learns an orthogonal mapping from VFM to VLM feature space to preserve geometry and improve cross-model compatibility for zero-shot recognition and segmentation.
KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models cs.LG · 2026-06-02 · unverdicted · none · ref 4 · internal anchor
KODA uses modality-wise kernel composition and constrained optimization to discover interpretable discrepancy structures between vision-language representations.
Beyond Compression: Quantifying Spectral Accessibility in Vision Representations cs.CV · 2026-06-02 · unverdicted · none · ref 10 · internal anchor
Vision encoders alter spectral accessibility non-monotonically across depth with architecture-specific effects from projections and pooling, quantified via a new residual loss against random baselines.
PRISM: Synergizing Vision Foundation Models via Self-organized Expert Specialization cs.CV · 2026-06-02 · unverdicted · none · ref 8 · internal anchor
PRISM is a two-stage MoE framework that achieves new state-of-the-art results on PASCAL-Context and NYUD-v2 by enabling self-organized expert specialization across diverse vision foundation models.
GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations cs.CV · 2026-06-02 · unverdicted · none · ref 41 · internal anchor
GLINT introduces sparsely gated alignment and dense feature regularization on top of DINOv3 and V-JEPA encoders to enable query-specific zero-shot grounding and segmentation in 2D CXR and 3D CT.
DOME: Learning Transferable Domain Variables from Sparse Supervision for Test-Time Adaptation cs.CV · 2026-06-02 · unverdicted · none · ref 35 · internal anchor
DOME learns sample-specific domain variables from sparse supervision via vision-language models and a sparse domain bank to improve test-time adaptation performance.
BEAST3D: Animal behavioral analysis and neural encoding from multi-view video via Gaussian splatting q-bio.NC · 2026-06-01 · unverdicted · none · ref 29 · internal anchor
BEAST3D learns viewpoint-invariant 3D features from calibrated multi-view animal videos via Gaussian splatting for novel view synthesis, pose estimation, and neural encoding across four species.
MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents cs.CV · 2026-06-01 · unverdicted · none · ref 31 · internal anchor
MORPHOS introduces an autoregressive 4D generation method with Temporal Structured Latents (T-SLAT) that produces dynamic 3D assets from videos while handling topological changes and long sequences.
FlatVPR: Plug-and-play Geo-linear Residual Adapter for Geometric Rectification of Foundation Model Feature Manifolds cs.CV · 2026-06-01 · unverdicted · none · ref 3 · internal anchor
FlatVPR adds a learnable residual adapter and a curvature-minimizing loss to foundation-model features so that descriptors between distant anchors can be reconstructed by linear interpolation, improving VPR on NCLT at 100 m spacing.
DeblurNVS: Geometric Latent Diffusion for Novel View Synthesis from Sparse Motion-Blurred Images cs.CV · 2026-05-31 · unverdicted · none · ref 46 · internal anchor
DeblurNVS restores geometric representations via latent diffusion to enable high-fidelity novel view synthesis directly from sparse motion-blurred inputs.
Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs cs.RO · 2026-05-31 · unverdicted · none · ref 40 · internal anchor
Dynamic scene graphs serve as explicit memory to improve imitation learning policies for spatial-temporal reasoning under partial observability in mobile and tabletop manipulation.
CAFOSat: A Strongly Annotated Dataset for Infrastructure-Aware CAFO Mapping Using High-Resolution Imagery cs.CV · 2026-05-30 · unverdicted · none · ref 34 · internal anchor
CAFOSat is a new strongly annotated remote-sensing dataset for CAFO mapping that uses human-in-the-loop refinement and curated negatives, with benchmarks on CNNs, transformers, and vision-language models plus a synthetic augmentation pipeline.
Spatial Transcriptomics-Guided Alignment Enhances Molecular Profiling in Pathology Foundation Model cs.LG · 2026-05-29 · unverdicted · none · ref 13 · internal anchor
STAMP uses a curated 1.8M-pair spatial transcriptomics atlas and pathway-informed alignment to augment pathology foundation models for molecular phenotype inference from H&E WSIs.
HARP-VLA: Human-Robot Aligned Representation Learning for Vision-Language-Action Model cs.RO · 2026-05-29 · unverdicted · none · ref 40 · internal anchor
HARP aligns human-robot visual and latent action representations via paired bridges and unpaired dynamics supervision to boost VLA policy performance on manipulation tasks.
VLM3: Vision Language Models Are Native 3D Learners cs.CV · 2026-05-28 · unverdicted · none · ref 12 · internal anchor
Standard VLMs achieve expert-level 3D performance on depth estimation, pose estimation, and object understanding via three simple techniques without architecture changes or regression losses.
Fairness Beyond Demographics: Optimizing Performance Across Appearance-Based Hidden Cohorts in Medical Imaging cs.CV · 2026-05-28 · unverdicted · none · ref 26 · internal anchor
LHCF trains medical image models for fairness by optimizing across latent appearance-based cohorts discovered via clustering, achieving SOTA results on single and multiple demographic attributes without using any demographic labels.
Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification cs.CV · 2026-05-27 · unverdicted · none · ref 51 · internal anchor
Introduces VIP identification task, releases Temporal-VIP dataset, and presents VIP-Net framework that achieves 67.3% accuracy on identifying important persons in videos while providing rationale similarity of 0.63.
Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation cs.CV · 2026-05-27 · unverdicted · none · ref 31 · internal anchor
DeGO decouples rigid and nonrigid motion in Gaussian occupancy prediction via factorized 4D distillation from VGGT, reporting SOTA results on Occ3D-NuScenes with 13.5% gains on human-centric cases.
Turning Video Models into Generalist Robot Policies cs.RO · 2026-05-27 · unverdicted · none · ref 53 · internal anchor
Decouples action-free video world models from embodiment-specific IDMs using Jacobian-based translation to achieve zero-shot cross-embodiment robot policies.
Trinity: Unifying Class-Agnostic Terrain and Semantic Segmentation for Unstructured Outdoor Environments by Leveraging Synthetic Data cs.RO · 2026-05-26 · unverdicted · none · ref 41 · internal anchor
Trinity is a unified transformer that performs both class-specific semantic segmentation and class-agnostic terrain segmentation, trained on synthetic RUGDSynth data and evaluated on the new EXTerra real-world dataset.
Representation-Conditioned Diffusion Models for Guided Training Data Generation cs.CV · 2026-05-26 · unverdicted · none · ref 11 · internal anchor
Representation-conditioned diffusion models generate synthetic ImageNet data that trains classifiers to higher top-1 accuracy than class-conditioned generation (+10.76 pp) or real data (+2.0 pp when scaled).
Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery cs.CV · 2026-05-25 · unverdicted · none · ref 26 · internal anchor
A Perceiver IO fusion architecture combines satellite and street-level imagery via DINOv2 tokens and RGB-M masking to classify roof attributes on a new dataset of 32,135 buildings across ten countries.

DINOv2: Learning Robust Visual Features without Supervision

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer