super hub Mixed citations

DINOv2: Learning Robust Visual Features without Supervision

Huy Vo, Marc Szafraniec, Maxime Oquab, Vasil Khalidov · 2023 · cs.CV · arXiv 2304.07193

Mixed citation behavior. Most common role is background (44%).

819 Pith papers citing it

Background 44% of classified citations

open full Pith review browse 819 citing papers more from Huy Vo arXiv PDF

abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 59 background 57 baseline 9 dataset 3 other 1

citation-polarity summary

background 57 use method 57 baseline 9 unclear 4 use dataset 2

claims ledger

abstract The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques

authors

Huy Vo Marc Szafraniec Maxime Oquab Th\'eo Moutakanni Timoth\'ee Darcet Vasil Khalidov

co-cited works

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

WildBox: A Dataset and Benchmark for Aerial Monocular 3D Detection of African Savanna Wildlife

cs.CV · 2026-06-19 · unverdicted · novelty 8.0

WildBox provides over 237k 3D wildlife annotations from drone video and benchmarks reveal zero-shot 3D detection at 0 AP but fine-tuned performance of 8.68 AP-BEV and 13.17 AP3D, with depth estimation causing most errors.

X-Palm: Paired Multispectral-to-Smartphone Dataset for Cross-Domain Palmprint Authentication

eess.IV · 2026-06-07 · unverdicted · novelty 8.0

X-Palm supplies the first paired multispectral-to-smartphone palmprint dataset with broad real-world variability to support cross-domain biometric authentication.

Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects

cs.CV · 2026-05-27 · conditional · novelty 8.0

Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.

CalibAnyView: Beyond Single-View Camera Calibration in the Wild

cs.CV · 2026-05-14 · conditional · novelty 8.0

A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

cs.GR · 2026-05-13 · unverdicted · novelty 8.0

Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

cs.CV · 2026-04-13 · unverdicted · novelty 8.0

The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Show Me Examples: Inferring Visual Concepts from Image Sets

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

Introduces VICIS task and training framework for inferring visual concepts from image sets, with experiments showing better accuracy, diversity, and generalization than standard VLMs on synthetic and ImageNet data.

InvSplat: Inverse Feed-Forward Scene Splatting

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

InvSplat is a feed-forward multi-view model that predicts 3D Gaussians augmented with intrinsic material attributes for inverse rendering and relighting.

Understanding Geometric Representations in Self-Supervised Vision Transformers via Subspace Intervention

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

The subspace intervention framework reveals that pre-training objectives shape how ViTs encode geometric information in compressible low-rank subspaces, with peak precision at intermediate layers.

From Forgeries to Foundation Models: A Systematic Survey of Identity Document Attack and Detection

cs.CR · 2026-07-01 · unverdicted · novelty 7.0

A systematic survey unifies presentation, digital injection, and GenAI synthesis attacks on identity documents, audits datasets for a reality gap, identifies SDGI in multimodal models, and reports APCER above 25% for top models on synthetic IDs.

Prototype Memory-Guided Training-Free Anomaly Classification and Localization in Prenatal Ultrasound

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

A training-free prototype memory-guided framework for multi-class prenatal ultrasound anomaly classification and localization using few reference images per class, validated on a 9-category multi-center dataset.

EPO: Boosting 3D Foundation Models with Edge-based Pose Optimization

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

EPO is a trackless, edge-map-alignment framework that refines pose estimates from 3D foundation models and matches or exceeds bundle-adjustment performance with substantially lower runtime and memory use.

GEAR: Guided End-to-End AutoRegression for Image Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.

WarpHammer: Densifying Scene Warps with 3D Object Priors for Extreme View Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.

AnyMatch: Supercharging Universal Multi-Modal Image Matching with Large-Scale Single-View Images

cs.CV · 2026-06-30 · unverdicted · novelty 7.0 · 2 refs

AnyMatch synthesizes large-scale geometrically consistent multi-modal image pairs from single-view images, enabling fine-tuned matching networks to achieve substantial gains on benchmarks.

Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

A new dataset of 220k+ cross-view pairs and a single-stage geometry-aware model GAGeo based on the π³ 3D foundation model outperforms prior methods on object geo-localization with strong generalization and zero-shot ground-to-drone capability.

Complete virtual unwrapping and reading of a rolled Herculaneum papyrus

eess.IV · 2026-06-27 · unverdicted · novelty 7.0

First complete digital unwrapping and reading of a Herculaneum papyrus scroll (PHerc. 1667) via synchrotron X-ray CT, virtual unrolling, and machine learning.

Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors

cs.RO · 2026-06-26 · conditional · novelty 7.0

Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.

citing papers explorer

Showing 50 of 819 citing papers.

Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation cs.RO · 2026-05-02 · unverdicted · none · ref 17 · internal anchor
Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
InpaintSLat: Inpainting Structured 3D Latents via Initial Noise Optimization cs.CV · 2026-05-01 · unverdicted · none · ref 14 · internal anchor
Optimizing initial noise via backpropagation approximation and spectral parameterization in structured 3D latent diffusion yields higher contextual consistency and prompt alignment in training-free inpainting.
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer cs.CV · 2026-05-01 · unverdicted · none · ref 24 · internal anchor
An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.
MeshReGen: A Unified 3D Geometry Regeneration Framework cs.CV · 2026-04-30 · unverdicted · none · ref 43 · 2 links · internal anchor
MeshReGen introduces a conditioned 3D geometry regenerator with VecSet that learns a regeneration prior via self-supervision and reports state-of-the-art results on controllable generation tasks.
Parameter-Efficient Adaptation of Pre-Trained Vision Foundation Models for Active and Passive Seismic Data Denoising physics.geo-ph · 2026-04-30 · conditional · none · ref 58 · internal anchor
Adapting vision foundation models with LoRA and kurtosis-guided unsupervised test-time adaptation matches or exceeds domain-specific models for seismic denoising across multiple sites and unseen data.
REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement cs.CV · 2026-04-30 · unverdicted · none · ref 32 · internal anchor
REVIVE 3D generates voluminous 3D assets from flat 2D images via an inflated prior construction followed by latent-space refinement, plus new metrics for volume and flatness validated by user study.
Sparse-View 3D Gaussian Splatting in the Wild cs.CV · 2026-04-30 · unverdicted · none · ref 35 · internal anchor
A new sparse-view 3D Gaussian splatting method for unconstrained scenes with distractors combines diffusion-based reference-guided refinement and sparsity-aware Gaussian replication to achieve better rendering quality.
Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations cs.CV · 2026-04-29 · unverdicted · none · ref 60 · internal anchor
RecGen achieves state-of-the-art 3D multi-object scene reconstruction from sparse RGB-D views by combining compositional synthetic scene generation with strong 3D shape priors, outperforming SAM3D by 30%+ in shape quality and pose accuracy while using 80% fewer meshes.
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners cs.CV · 2026-04-29 · unverdicted · none · ref 2 · internal anchor
LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentation, surface normal estimation, and semantic segmentation.
Generalized Category Discovery under Domain Shifts: From Vision to Vision-Language Models cs.CV · 2026-04-29 · unverdicted · none · ref 13 · internal anchor
Three frameworks adapt foundation models for generalized category discovery under domain shifts via disentanglement and prompt tuning, showing gains on synthetic and real multi-domain data.
Beyond Fidelity: Semantic Similarity Assessment in Low-Level Image Processing cs.CV · 2026-04-28 · unverdicted · none · ref 30 · internal anchor
T3S is a new semantic similarity score for processed images that decomposes semantics into foreground entities, background entities, and relations, outperforming fidelity metrics on COCO and SPA-Data.
SS3D: End2End Self-Supervised 3D from Web Videos cs.CV · 2026-04-24 · unverdicted · none · ref 33 · 3 links · internal anchor
SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior self-supervised baselines.
Are Natural-Domain Foundation Models Effective for Accelerated Cardiac MRI Reconstruction? eess.IV · 2026-04-24 · unverdicted · none · ref 27 · internal anchor
Natural-domain foundation models provide competitive and more robust priors than task-specific models for accelerated cardiac MRI reconstruction in cross-domain settings.
Text-Guided Multimodal Unified Industrial Anomaly Detection cs.CV · 2026-04-24 · unverdicted · none · ref 32 · internal anchor
A text-semantics-guided multimodal framework with geometry-aware mapping and object-conditioned text adaptation achieves state-of-the-art unsupervised anomaly detection and localization on RGB-3D industrial datasets while enabling a single model for multiple classes.
Unlocking Optical Prior: Spectrum-Guided Knowledge Transfer for SAR Generalized Category Discovery cs.CV · 2026-04-24 · unverdicted · none · ref 59 · internal anchor
A frequency-domain modeling approach transfers optical priors to SAR imagery via paired pre-training, enabling state-of-the-art generalized category discovery on SAR data.
Euclid Quick Data Release (Q1). AstroVink: A vision transformer approach to find strong gravitational lens systems astro-ph.IM · 2026-04-23 · conditional · none · ref 51 · internal anchor
A vision transformer classifier trained on simulated and real Euclid data recovers all known strong lenses in test sets and finds 8 Grade A plus 26 Grade B new candidates in the Q1 data.
DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures cs.CV · 2026-04-23 · unverdicted · none · ref 17 · internal anchor
DualSplat bootstraps object-level pseudo-masks from initial 3DGS reconstruction failures using residuals and SAM2 to enable robust second-pass optimization in transient-heavy scenes.
Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers cs.CV · 2026-04-23 · unverdicted · none · ref 24 · internal anchor
Sculpt4D generates temporally coherent 4D shapes by integrating a block sparse attention mechanism with time-decaying mask into a pretrained 3D diffusion transformer, achieving SOTA results with 56% less computation.
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation cs.CV · 2026-04-23 · unverdicted · none · ref 24 · internal anchor
Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models cs.CV · 2026-04-22 · unverdicted · none · ref 48 · internal anchor
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
Self-supervised pretraining for an iterative image size agnostic vision transformer cs.CV · 2026-04-22 · unverdicted · none · ref 41 · internal anchor
A sequential-to-global SSL method based on DINO pretrains iterative foveal-inspired vision transformers to achieve competitive ImageNet-1K performance with constant compute regardless of input resolution.
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling cs.RO · 2026-04-21 · unverdicted · none · ref 35 · internal anchor
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
ReCap: Lightweight Referential Grounding for Coherent Story Visualization cs.CV · 2026-04-20 · unverdicted · none · ref 29 · internal anchor
ReCap improves character consistency in story visualization by 2.63% on FlintstonesSV and 5.65% on PororoSV using a selective pronoun-based conditioning module and training-only semantic drift correction.
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation cs.RO · 2026-04-20 · unverdicted · none · ref 31 · internal anchor
ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
HSG: Hyperbolic Scene Graph cs.CV · 2026-04-19 · unverdicted · none · ref 40 · internal anchor
Hyperbolic Scene Graph (HSG) learns embeddings in hyperbolic space for better hierarchical structure in scene graphs, achieving graph IoU of 33.51 versus 25.37 for the best Euclidean baseline.
Instant Colorization of Gaussian Splats cs.CV · 2026-04-18 · unverdicted · none · ref 44 · internal anchor
Instant colorization of Gaussian splats is achieved via per-Gaussian visibility-weighted least squares solved with the normal equation for up to 10x speedup over gradient descent.
One-Shot Cross-Geometry Skill Transfer through Part Decomposition cs.RO · 2026-04-16 · unverdicted · none · ref 17 · internal anchor
Part decomposition with generative shape models allows one-shot robot skill transfer across unfamiliar object geometries in simulation and real settings.
UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception cs.RO · 2026-04-15 · unverdicted · none · ref 21 · internal anchor
UMI-3D integrates LiDAR into the UMI hardware for robust multimodal 3D perception in manipulation demonstrations, yielding higher policy success rates and enabling previously infeasible tasks like deformable object handling.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective cs.CV · 2026-04-15 · unverdicted · none · ref 215 · internal anchor
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation cs.CV · 2026-04-15 · conditional · none · ref 48 · internal anchor
SceneGlue introduces a scene-aware feature matching framework using implicit parallel attention and an explicit Visibility Transformer, trained solely on local feature matches to enhance cross-view correspondence without scene annotations.
Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data cs.CV · 2026-04-15 · unverdicted · none · ref 52 · internal anchor
BVE framework enables text-guided 3D editing beyond voxel limits by combining self-constructed data, lightweight semantic injection, and annotation-free masking to preserve local invariance.
What Are We Really Measuring? Rethinking Dataset Bias in Web-Scale Natural Image Collections via Unsupervised Semantic Clustering cs.CV · 2026-04-15 · unverdicted · none · ref 7 · internal anchor
Supervised dataset discrimination overstates semantic bias via resolution artifacts; unsupervised clustering of foundation-model features on web-scale datasets yields near-chance separability.
Boosting Visual Instruction Tuning with Self-Supervised Guidance cs.CV · 2026-04-14 · unverdicted · none · ref 58 · internal anchor
Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.
Pi-HOC: Pairwise 3D Human-Object Contact Estimation cs.CV · 2026-04-14 · unverdicted · none · ref 23 · 2 links · internal anchor
Pi-HOC predicts dense 3D semantic contacts for all human-object pairs in an image via instance-aware tokens and an InteractionFormer, achieving higher accuracy and 20x throughput than prior methods.
Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models cs.CV · 2026-04-14 · unverdicted · none · ref 28 · internal anchor
Decoder-side Temporal Rebalancing (DTR) reduces hallucinations in Video-LLMs by mitigating over-dominance of a single anchor frame during inference without training or auxiliary models.
Cross-Attentive Multiview Fusion of Vision-Language Embeddings cs.CV · 2026-04-14 · unverdicted · none · ref 23 · internal anchor
CAMFusion fuses multiview 2D vision-language embeddings via cross-attention and multiview consistency self-supervision to produce better 3D semantic and instance representations, outperforming averaging and reaching SOTA on benchmarks including zero-shot out-of-domain cases.
ArtifactWorld: Scaling 3D Gaussian Splatting Artifact Restoration via Video Generation Models cs.CV · 2026-04-14 · unverdicted · none · ref 13 · internal anchor
ArtifactWorld restores artifacts in 3D Gaussian Splatting by training a video diffusion backbone on 107.5K paired clips with an isomorphic predictor for artifact heatmaps and an Artifact-Aware Triplet Fusion mechanism to achieve better sparse-view novel synthesis.
VVGT: Visual Volume-Grounded Transformer cs.GR · 2026-04-14 · unverdicted · none · ref 19 · internal anchor
VVGT is a dual-transformer network with Volume Geometry Forcing that maps volumetric data to 3D Gaussian primitives for accurate ray-based rendering without per-scene optimization.
Learning Long-term Motion Embeddings for Efficient Kinematics Generation cs.CV · 2026-04-13 · unverdicted · none · ref 31 · internal anchor
A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
Continuous Adversarial Flow Models cs.LG · 2026-04-13 · unverdicted · none · ref 55 · internal anchor
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks cs.CV · 2026-04-13 · unverdicted · none · ref 19 · internal anchor
CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
Self-supervised Pretraining of Cell Segmentation Models cs.CV · 2026-04-12 · unverdicted · none · ref 14 · internal anchor
DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.
Enhancing Fine-Grained Spatial Grounding in 3D CT Report Generation via Discriminative Guidance cs.CV · 2026-04-12 · unverdicted · none · ref 34 · internal anchor
DCP-PD improves macro F1 scores on CT report generation benchmarks and introduces a hierarchical location-aware evaluation protocol that reveals ongoing challenges in pathology spatial grounding.
Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis cs.CV · 2026-04-11 · unverdicted · none · ref 5 · internal anchor
Transferring a 2D MLLM to 3D CT inputs via parameter reuse, a Text-Guided Hierarchical MoE framework, and two-stage training yields better performance than prior 3D medical MLLMs on medical report generation and visual question answering.
ACCIDENT: A Benchmark Dataset for Vehicle Accident Detection from Traffic Surveillance Videos cs.CV · 2026-04-10 · accept · none · ref 27 · internal anchor
ACCIDENT is a new benchmark with 2,027 real and 2,211 synthetic annotated video clips for temporal localization, spatial localization, and collision type classification of vehicle accidents in CCTV footage.
V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation cs.RO · 2026-04-10 · unverdicted · none · ref 17 · internal anchor
V-CAGE automates the creation of scalable, high-quality robotic manipulation datasets through context-aware scene construction, closed-loop visual verification, and perceptually-driven compression.
Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding cs.LG · 2026-04-09 · unverdicted · none · ref 79 · internal anchor
A meta-optimized in-context learning approach enables training-free cross-subject semantic visual decoding from fMRI by inferring individual neural encoding patterns via hierarchical inference on a few examples.
Self-Improving 4D Perception via Self-Distillation cs.CV · 2026-04-09 · unverdicted · none · ref 43 · internal anchor
SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight benchmarks.
TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders cs.CV · 2026-04-08 · unverdicted · none · ref 7 · internal anchor
TC-AE improves reconstruction and generative performance in deep compression by decomposing token-to-latent compression into two stages and using joint self-supervised training.
GIRL: Generative Imagination Reinforcement Learning via Information-Theoretic Hallucination Control cs.LG · 2026-04-08 · unverdicted · none · ref 6 · internal anchor
GIRL reduces latent rollout drift by 38-61% versus DreamerV3 in MBRL by grounding transitions with DINOv2 embeddings and using an information-theoretic adaptive bottleneck, yielding better long-horizon returns on control benchmarks.

DINOv2: Learning Robust Visual Features without Supervision

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer