super hub Mixed citations

DINOv2: Learning Robust Visual Features without Supervision

Huy Vo, Marc Szafraniec, Maxime Oquab, Vasil Khalidov · 2023 · cs.CV · arXiv 2304.07193

Mixed citation behavior. Most common role is background (44%).

588 Pith papers citing it

Background 44% of classified citations

open full Pith review browse 588 citing papers more from Huy Vo arXiv PDF

abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 59 background 57 baseline 9 dataset 3 other 1

citation-polarity summary

background 57 use method 57 baseline 9 unclear 4 use dataset 2

claims ledger

abstract The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques

authors

Huy Vo Marc Szafraniec Maxime Oquab Th\'eo Moutakanni Timoth\'ee Darcet Vasil Khalidov

co-cited works

representative citing papers

CalibAnyView: Beyond Single-View Camera Calibration in the Wild

cs.CV · 2026-05-14 · conditional · novelty 8.0

A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

cs.GR · 2026-05-13 · unverdicted · novelty 8.0

Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

cs.CV · 2026-04-13 · unverdicted · novelty 8.0

The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Polarisation and Faraday rotation measure imaging at metre wavelengths with sub-arcsecond resolution: a foundational calibration strategy

astro-ph.IM · 2026-06-16 · unverdicted · novelty 7.0

A calibration strategy using full-Jones corrections with an in-field unpolarised calibrator and visibility-based multi-epoch alignment enables sub-arcsecond polarimetric imaging with LOFAR at metre wavelengths.

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.

How Neural Losses Shape VAE Latents

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

Neural reconstruction losses in VAEs reduce latent information content and produce more isotropic latent geometries with even uncertainty distribution.

YARD: Y-Architecture Register Decoding for Efficient Hallucination Mitigation in Large Vision-Language Models

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

YARD is a training-free method using Y-shaped decoder architecture and register tokens to improve contrastive decoding for hallucination reduction in LVLMs with lower latency.

Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

A 3D-aware framework uses SAM3D geometry and pose estimation plus geodesic filtering to supervise a lightweight adapter on DINO and Stable Diffusion features, improving semantic correspondence with less manual supervision.

FRUC: Feedforward Dynamic Scene Reconstruction from Uncalibrated Collaborative Driving Views

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

FRUC enables one-shot calibration-free dynamic scene reconstruction from collaborative driving views via a geometric Transformer, ego-centric occlusion priors, and zero-initialized residual denoising, claiming SOTA quality and speed on V2XReal and UrbanIng-V2X.

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.

No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.

EventGait: Towards Robust Gait Recognition with Event Streams

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

EventGait is a dual-stream spiking and cross-modal framework for event-based gait recognition that matches or exceeds RGB methods in normal conditions and significantly outperforms them in low light, supported by new synthetic event gait benchmarks.

Seeing Through Fog: Towards Fog-Invariant Action Recognition

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Introduces FogAct paired clean-foggy video dataset and FogNet two-stream CLIP model that learns fog-invariant semantic representations via clean-video guidance.

Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes weighted aggregation of clusters and self-distillation-driven token pruning to improve both accuracy and efficiency in ViT-based visual place recognition.

Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models

cs.CV · 2026-05-19 · conditional · novelty 7.0

Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.

PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

PrAda adapts text-prompted segmentation models in a few-shot setting by learning and fusing class-specific prototypes from fine-grained and high-level features, yielding significant gains on semantic, instance, and panoptic segmentation across five benchmarks.

deadtrees.earth-aerial: A Multi-Resolution Aerial Image Dataset for Tree Cover and Mortality Detection

cs.CV · 2026-05-19 · accept · novelty 7.0

Releases DTE-aerial-train (385K patches) and DTE-aerial-bench (25 global orthoimages) as the first harmonized multi-resolution datasets for joint tree cover and mortality segmentation across biomes.

CineMatte: Background Matting for Virtual Production and Beyond

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

CineMatte uses a cross-attention design on a Siamese DINOv3 ViT plus a pretrained upsampler to produce robust mattes for virtual production, backed by a new non-synthetic 4K VP dataset that supports camera motion.

Best Segmentation Buddies for Image-Shape Correspondence

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

The work defines Best Segmentation Buddies as vertices on a 3D shape whose nearest image pixel under distilled features falls inside a given 2D segment, then uses the same features to segment the shape in 3D.

citing papers explorer

Showing 50 of 588 citing papers.

CalibAnyView: Beyond Single-View Camera Calibration in the Wild cs.CV · 2026-05-14 · conditional · none · ref 35 · internal anchor
A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.
Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation cs.GR · 2026-05-13 · unverdicted · none · ref 7 · internal anchor
Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.
On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models cs.CR · 2026-05-10 · conditional · none · ref 26 · internal anchor
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing cs.CV · 2026-04-17 · unverdicted · none · ref 39 · internal anchor
neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation cs.CV · 2026-04-13 · unverdicted · none · ref 20 · internal anchor
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 91 · internal anchor
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark cs.CL · 2024-09-04 · accept · none · ref 43 · internal anchor
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution cs.CL · 2023-09-28 · unverdicted · none · ref 249 · internal anchor
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Polarisation and Faraday rotation measure imaging at metre wavelengths with sub-arcsecond resolution: a foundational calibration strategy astro-ph.IM · 2026-06-16 · unverdicted · none · ref 17 · internal anchor
A calibration strategy using full-Jones corrections with an in-field unpolarised calibrator and visibility-based multi-epoch alignment enables sub-arcsecond polarimetric imaging with LOFAR at metre wavelengths.
Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing cs.CV · 2026-05-31 · unverdicted · none · ref 33 · internal anchor
Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.
How Neural Losses Shape VAE Latents cs.LG · 2026-05-30 · unverdicted · none · ref 29 · internal anchor
Neural reconstruction losses in VAEs reduce latent information content and produce more isotropic latent geometries with even uncertainty distribution.
YARD: Y-Architecture Register Decoding for Efficient Hallucination Mitigation in Large Vision-Language Models cs.CV · 2026-05-29 · unverdicted · none · ref 34 · internal anchor
YARD is a training-free method using Y-shaped decoder architecture and register tokens to improve contrastive decoding for hallucination reduction in LVLMs with lower latency.
Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence cs.CV · 2026-05-28 · unverdicted · none · ref 28 · internal anchor
A 3D-aware framework uses SAM3D geometry and pose estimation plus geodesic filtering to supervise a lightweight adapter on DINO and Stable Diffusion features, improving semantic correspondence with less manual supervision.
FRUC: Feedforward Dynamic Scene Reconstruction from Uncalibrated Collaborative Driving Views cs.CV · 2026-05-28 · unverdicted · none · ref 20 · internal anchor
FRUC enables one-shot calibration-free dynamic scene reconstruction from collaborative driving views via a geometric Transformer, ego-centric occlusion priors, and zero-initialized residual denoising, claiming SOTA quality and speed on V2XReal and UrbanIng-V2X.
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models cs.CV · 2026-05-22 · unverdicted · none · ref 38 · internal anchor
CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos cs.CV · 2026-05-21 · unverdicted · none · ref 43 · internal anchor
NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.
EventGait: Towards Robust Gait Recognition with Event Streams cs.CV · 2026-05-21 · unverdicted · none · ref 60 · internal anchor
EventGait is a dual-stream spiking and cross-modal framework for event-based gait recognition that matches or exceeds RGB methods in normal conditions and significantly outperforms them in low light, supported by new synthetic event gait benchmarks.
Seeing Through Fog: Towards Fog-Invariant Action Recognition cs.CV · 2026-05-20 · unverdicted · none · ref 27 · internal anchor
Introduces FogAct paired clean-foggy video dataset and FogNet two-stream CLIP model that learns fog-invariant semantic representations via clean-video guidance.
Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning cs.CV · 2026-05-19 · unverdicted · none · ref 16 · internal anchor
Proposes weighted aggregation of clusters and self-distillation-driven token pruning to improve both accuracy and efficiency in ViT-based visual place recognition.
Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models cs.CV · 2026-05-19 · conditional · none · ref 4 · internal anchor
Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.
PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation cs.CV · 2026-05-19 · unverdicted · none · ref 56 · internal anchor
PrAda adapts text-prompted segmentation models in a few-shot setting by learning and fusing class-specific prototypes from fine-grained and high-level features, yielding significant gains on semantic, instance, and panoptic segmentation across five benchmarks.
deadtrees.earth-aerial: A Multi-Resolution Aerial Image Dataset for Tree Cover and Mortality Detection cs.CV · 2026-05-19 · accept · none · ref 25 · internal anchor
Releases DTE-aerial-train (385K patches) and DTE-aerial-bench (25 global orthoimages) as the first harmonized multi-resolution datasets for joint tree cover and mortality segmentation across biomes.
CineMatte: Background Matting for Virtual Production and Beyond cs.CV · 2026-05-18 · unverdicted · none · ref 31 · internal anchor
CineMatte uses a cross-attention design on a Siamese DINOv3 ViT plus a pretrained upsampler to produce robust mattes for virtual production, backed by a new non-synthetic 4K VP dataset that supports camera motion.
Best Segmentation Buddies for Image-Shape Correspondence cs.CV · 2026-05-18 · unverdicted · none · ref 43 · internal anchor
The work defines Best Segmentation Buddies as vertices on a 3D shape whose nearest image pixel under distilled features falls inside a given 2D segment, then uses the same features to segment the shape in 3D.
PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment cs.LG · 2026-05-17 · unverdicted · none · ref 51 · internal anchor
PEIRA learns predictive encoders by optimizing the trace of the optimal inter-view linear regressor, with only nontrivial global minimizers as stable equilibria that recover leading nonlinear canonical correlation subspaces.
RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation cs.RO · 2026-05-17 · unverdicted · none · ref 20 · internal anchor
RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.
Seeking the Unfamiliar but Memorable: Conceptual Creativity as Meta-Learning cs.LG · 2026-05-15 · unverdicted · none · ref 31 · internal anchor
Creativity is defined as meta-learning where a frozen diffusion creator optimizes candidates for rapid improvement by an adapting appraiser such as an autoencoder or CLIP adapter.
SurF: A Generative Model for Multivariate Irregular Time Series Forecasting cs.LG · 2026-05-13 · unverdicted · none · ref 13 · internal anchor
SurF applies the Time Rescaling Theorem as a learnable bijection to create a single generative model for forecasting irregular multivariate event streams that outperforms or matches baselines on six benchmarks.
RotVLA: Rotational Latent Action for Vision-Language-Action Model cs.RO · 2026-05-13 · unverdicted · none · ref 46 · internal anchor
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
Does Engram Do Memory Retrieval in Autoregressive Image Generation? cs.CV · 2026-05-13 · accept · none · ref 7 · internal anchor
Engram in AR image generation saves backbone FLOPs but trails pure AR baselines in FID and behaves as a gated side-pathway rather than a content-addressed retriever.
SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning cs.LG · 2026-05-13 · unverdicted · none · ref 43 · internal anchor
SMA uses a submodular mutual information objective on data sets to deliver competitive zero-shot classification and retrieval performance on CLIP benchmarks with only tens of thousands of samples, orders of magnitude fewer than standard approaches.
Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic cs.LG · 2026-05-12 · unverdicted · none · ref 133 · 2 links · internal anchor
Embedding Temporal Logic (ETL) performs runtime monitoring directly in learned embedding spaces using distance-based predicates composed with temporal operators, supported by conformal calibration for reliable predicate evaluation.
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives cs.CV · 2026-05-12 · unverdicted · none · ref 32 · internal anchor
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models cs.CV · 2026-05-12 · unverdicted · none · ref 45 · internal anchor
Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming baselines on a new PAd1M dataset.
DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies cs.RO · 2026-05-12 · unverdicted · none · ref 36 · internal anchor
DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.
PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting cs.CV · 2026-05-12 · unverdicted · none · ref 22 · internal anchor
PointGS achieves semantic-consistent unsupervised 3D point cloud segmentation by using 3D Gaussian Splatting to bridge discrete points and continuous 2D images for distilling SAM semantics.
STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models cs.CV · 2026-05-12 · unverdicted · none · ref 29 · internal anchor
STRIDE boosts diversity in one-step diffusion models by injecting PCA-aligned pink noise into transformer features while preserving text alignment and quality.
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation cs.RO · 2026-05-12 · conditional · none · ref 21 · internal anchor
A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization cs.CV · 2026-05-11 · unverdicted · none · ref 17 · 2 links · internal anchor
DRoRAE adaptively fuses multi-layer features from vision encoders via energy-constrained routing to enrich visual tokens, cutting rFID from 0.57 to 0.29 and generation FID from 1.74 to 1.65 on ImageNet-256 while revealing a log-linear scaling law with fusion capacity.
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 29 · internal anchor
VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
When Style Similarity Scores Fail: Diagnosing Raw CSD Cosine in Artist-Style Evaluation cs.CV · 2026-05-09 · conditional · none · ref 9 · internal anchor
Raw CSD cosine similarity produces negative discrimination gaps for many artists and does not support absolute style-fidelity interpretation, but CSLS readout on frozen backbones reduces failures and improves AUC.
PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers cs.CV · 2026-05-08 · unverdicted · none · ref 15 · internal anchor
PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preserving reconstruction quality on ScanNet-50 and 7-Scenes.
MPD$^2$-Router: Mask-aware Multi-expert Prior-regularized Dual-head Deferral Router in Glaucoma Screening and Diagnosis cs.AI · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
MPD²-Router is a dual-head deferral router that uses mask-aware Gumbel-sigmoid gating, asymmetric cost-sensitive training, and rank-majorization regularization to lower clinical cost and raise MCC versus AI-only baselines while balancing expert utilization across three glaucoma cohorts.
Improved monocular depth prediction using distance transform over pre-semantic contours with self-supervised neural networks eess.IV · 2026-05-08 · unverdicted · none · ref 56 · internal anchor
Self-supervised monocular depth estimation improves in low-texture regions by using distance transforms on jointly estimated pre-semantic contours to create more informative loss signals.
What Cohort INRs Encode and Where to Freeze Them cs.LG · 2026-05-08 · unverdicted · none · ref 45 · internal anchor
Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs cs.CV · 2026-05-08 · unverdicted · none · ref 35 · internal anchor
Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.
SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis cs.CV · 2026-05-08 · unverdicted · none · ref 72 · 2 links · internal anchor
SplatWeaver uses cardinality Gaussian experts and pixel-level routing to dynamically allocate varying numbers of Gaussian primitives for generalizable novel view synthesis.
LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling cs.CV · 2026-05-08 · unverdicted · none · ref 25 · internal anchor
LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.
From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting cs.CV · 2026-05-08 · unverdicted · none · ref 37 · 2 links · internal anchor
GS-DIFF detects changes in 3D Gaussian Splatting scenes by direct primitive attribute comparison with anisotropic drift models and observability terms, outperforming render-then-compare baselines by ~17% mIoU.
Learning Visual Feature-Based World Models via Residual Latent Action cs.CV · 2026-05-08 · unverdicted · none · ref 38 · internal anchor
RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

DINOv2: Learning Robust Visual Features without Supervision

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer