super hub Mixed citations

DINOv2: Learning Robust Visual Features without Supervision

Huy Vo, Marc Szafraniec, Maxime Oquab, Vasil Khalidov · 2023 · cs.CV · arXiv 2304.07193

Mixed citation behavior. Most common role is background (44%).

587 Pith papers citing it

Background 44% of classified citations

open full Pith review browse 587 citing papers more from Huy Vo arXiv PDF

abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 59 background 57 baseline 9 dataset 3 other 1

citation-polarity summary

background 57 use method 57 baseline 9 unclear 4 use dataset 2

claims ledger

abstract The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques

authors

Huy Vo Marc Szafraniec Maxime Oquab Th\'eo Moutakanni Timoth\'ee Darcet Vasil Khalidov

co-cited works

representative citing papers

CalibAnyView: Beyond Single-View Camera Calibration in the Wild

cs.CV · 2026-05-14 · conditional · novelty 8.0

A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

cs.GR · 2026-05-13 · unverdicted · novelty 8.0

Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

cs.CV · 2026-04-13 · unverdicted · novelty 8.0

The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Polarisation and Faraday rotation measure imaging at metre wavelengths with sub-arcsecond resolution: a foundational calibration strategy

astro-ph.IM · 2026-06-16 · unverdicted · novelty 7.0

A calibration strategy using full-Jones corrections with an in-field unpolarised calibrator and visibility-based multi-epoch alignment enables sub-arcsecond polarimetric imaging with LOFAR at metre wavelengths.

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.

How Neural Losses Shape VAE Latents

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

Neural reconstruction losses in VAEs reduce latent information content and produce more isotropic latent geometries with even uncertainty distribution.

YARD: Y-Architecture Register Decoding for Efficient Hallucination Mitigation in Large Vision-Language Models

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

YARD is a training-free method using Y-shaped decoder architecture and register tokens to improve contrastive decoding for hallucination reduction in LVLMs with lower latency.

Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

A 3D-aware framework uses SAM3D geometry and pose estimation plus geodesic filtering to supervise a lightweight adapter on DINO and Stable Diffusion features, improving semantic correspondence with less manual supervision.

FRUC: Feedforward Dynamic Scene Reconstruction from Uncalibrated Collaborative Driving Views

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

FRUC enables one-shot calibration-free dynamic scene reconstruction from collaborative driving views via a geometric Transformer, ego-centric occlusion priors, and zero-initialized residual denoising, claiming SOTA quality and speed on V2XReal and UrbanIng-V2X.

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.

No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.

EventGait: Towards Robust Gait Recognition with Event Streams

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

EventGait is a dual-stream spiking and cross-modal framework for event-based gait recognition that matches or exceeds RGB methods in normal conditions and significantly outperforms them in low light, supported by new synthetic event gait benchmarks.

Seeing Through Fog: Towards Fog-Invariant Action Recognition

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Introduces FogAct paired clean-foggy video dataset and FogNet two-stream CLIP model that learns fog-invariant semantic representations via clean-video guidance.

Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes weighted aggregation of clusters and self-distillation-driven token pruning to improve both accuracy and efficiency in ViT-based visual place recognition.

Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models

cs.CV · 2026-05-19 · conditional · novelty 7.0

Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.

PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

PrAda adapts text-prompted segmentation models in a few-shot setting by learning and fusing class-specific prototypes from fine-grained and high-level features, yielding significant gains on semantic, instance, and panoptic segmentation across five benchmarks.

deadtrees.earth-aerial: A Multi-Resolution Aerial Image Dataset for Tree Cover and Mortality Detection

cs.CV · 2026-05-19 · accept · novelty 7.0

Releases DTE-aerial-train (385K patches) and DTE-aerial-bench (25 global orthoimages) as the first harmonized multi-resolution datasets for joint tree cover and mortality segmentation across biomes.

CineMatte: Background Matting for Virtual Production and Beyond

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

CineMatte uses a cross-attention design on a Siamese DINOv3 ViT plus a pretrained upsampler to produce robust mattes for virtual production, backed by a new non-synthetic 4K VP dataset that supports camera motion.

Best Segmentation Buddies for Image-Shape Correspondence

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

The work defines Best Segmentation Buddies as vertices on a 3D shape whose nearest image pixel under distilled features falls inside a given 2D segment, then uses the same features to segment the shape in 3D.

citing papers explorer

Showing 50 of 59 citing papers after filters.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models cs.CR · 2026-05-10 · conditional · none · ref 26 · internal anchor
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
RotVLA: Rotational Latent Action for Vision-Language-Action Model cs.RO · 2026-05-13 · unverdicted · none · ref 46 · internal anchor
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives cs.CV · 2026-05-12 · unverdicted · none · ref 32 · internal anchor
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models cs.CV · 2026-05-12 · unverdicted · none · ref 45 · internal anchor
Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming baselines on a new PAd1M dataset.
PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting cs.CV · 2026-05-12 · unverdicted · none · ref 22 · internal anchor
PointGS achieves semantic-consistent unsupervised 3D point cloud segmentation by using 3D Gaussian Splatting to bridge discrete points and continuous 2D images for distilling SAM semantics.
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation cs.RO · 2026-05-12 · conditional · none · ref 21 · internal anchor
A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization cs.CV · 2026-05-11 · unverdicted · none · ref 17 · 2 links · internal anchor
DRoRAE adaptively fuses multi-layer features from vision encoders via energy-constrained routing to enrich visual tokens, cutting rFID from 0.57 to 0.29 and generation FID from 1.74 to 1.65 on ImageNet-256 while revealing a log-linear scaling law with fusion capacity.
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 29 · internal anchor
VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers cs.CV · 2026-05-08 · unverdicted · none · ref 15 · internal anchor
PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preserving reconstruction quality on ScanNet-50 and 7-Scenes.
SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis cs.CV · 2026-05-08 · unverdicted · none · ref 72 · 2 links · internal anchor
SplatWeaver uses cardinality Gaussian experts and pixel-level routing to dynamically allocate varying numbers of Gaussian primitives for generalizable novel view synthesis.
LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling cs.CV · 2026-05-08 · unverdicted · none · ref 25 · internal anchor
LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 59 · internal anchor
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures cs.CV · 2026-05-05 · unverdicted · none · ref 55 · internal anchor
HeadsUp maps multi-view captures to UV-parameterized 3D Gaussians on a template via an encoder-decoder, achieving state-of-the-art quality and generalization after training on more than 10,000 subjects.
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation cs.CV · 2026-04-26 · unverdicted · none · ref 29 · 2 links · internal anchor
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 50 · internal anchor
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
WildDet3D: Scaling Promptable 3D Detection in the Wild cs.CV · 2026-04-09 · unverdicted · none · ref 39 · internal anchor
WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details cs.CV · 2026-04-08 · unverdicted · none · ref 32 · internal anchor
RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.
VOSR: A Vision-Only Generative Model for Image Super-Resolution cs.CV · 2026-04-03 · conditional · none · ref 25 · internal anchor
VOSR shows that competitive generative image super-resolution with faithful structures can be achieved by training a diffusion-style model from scratch on visual data alone, using a vision encoder for guidance and a restoration-oriented sampling strategy.
SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation cs.CV · 2026-04-03 · unverdicted · none · ref 27 · internal anchor
SD-FSMIS adapts Stable Diffusion for few-shot medical image segmentation via support-query interaction and visual-to-textual translation, yielding competitive performance and strong cross-domain generalization.
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation cs.RO · 2024-09-03 · conditional · none · ref 5 · internal anchor
ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models cs.CV · 2024-07-10 · unverdicted · none · ref 44 · internal anchor
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction cs.CV · 2026-05-14 · unverdicted · none · ref 72 · internal anchor
CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
From Image Hashing to Scene Change Detection cs.CV · 2026-05-12 · unverdicted · none · ref 23 · internal anchor
HashSCD is a patch-wise hashing method for unsupervised scene change detection and localization that operates directly in Hamming space with competitive performance and lower computational cost.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 39 · internal anchor
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
DiffATS: Diffusion in Aligned Tensor Space cs.LG · 2026-05-10 · unverdicted · none · ref 40 · internal anchor
DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with high compression.
Rethinking Dense Optical Flow without Test-Time Scaling cs.CV · 2026-05-08 · unverdicted · none · ref 28 · internal anchor
Dense optical flow can be estimated accurately in one forward pass by combining DINO-v2 semantic priors and monocular depth geometric cues via global matching, reaching 2.81 EPE on Sintel Final without any refinement.
Lightweight Unpaired Smartphone ISP Transfer with Semantic Pseudo-Pairing cs.CV · 2026-05-08 · conditional · none · ref 26 · internal anchor
Semantic pseudo-pairing via DINOv2 embeddings and fused Gromov-Wasserstein optimal transport enables training a 7K-parameter CNN for unpaired smartphone ISP, achieving 22.569 PSNR on the NTIRE 2026 challenge test set.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation cs.RO · 2026-05-06 · unverdicted · none · ref 55 · internal anchor
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
Stylistic Attribute Control in Latent Diffusion Models cs.CV · 2026-05-04 · unverdicted · none · ref 9 · internal anchor
A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.
MeshReGen: A Unified 3D Geometry Regeneration Framework cs.CV · 2026-04-30 · unverdicted · none · ref 43 · 2 links · internal anchor
MeshReGen introduces a conditioned 3D geometry regenerator with VecSet that learns a regeneration prior via self-supervision and reports state-of-the-art results on controllable generation tasks.
Generalized Category Discovery under Domain Shifts: From Vision to Vision-Language Models cs.CV · 2026-04-29 · unverdicted · none · ref 13 · internal anchor
Three frameworks adapt foundation models for generalized category discovery under domain shifts via disentanglement and prompt tuning, showing gains on synthetic and real multi-domain data.
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling cs.RO · 2026-04-21 · unverdicted · none · ref 35 · internal anchor
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective cs.CV · 2026-04-15 · unverdicted · none · ref 215 · internal anchor
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
Pi-HOC: Pairwise 3D Human-Object Contact Estimation cs.CV · 2026-04-14 · unverdicted · none · ref 23 · 2 links · internal anchor
Pi-HOC predicts dense 3D semantic contacts for all human-object pairs in an image via instance-aware tokens and an InteractionFormer, achieving higher accuracy and 20x throughput than prior methods.
Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding cs.LG · 2026-04-09 · unverdicted · none · ref 79 · internal anchor
A meta-optimized in-context learning approach enables training-free cross-subject semantic visual decoding from fMRI by inferring individual neural encoding patterns via hierarchical inference on a few examples.
Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks cs.CV · 2026-04-05 · conditional · none · ref 20 · internal anchor
VoxelFM learns robust 3D CT visual features via DINO self-distillation that transfer effectively to seven clinical task categories using frozen backbones and lightweight heads, outperforming prior CT foundation models even on report generation.
Gram-MMD: A Texture-Aware Metric for Image Realism Assessment cs.CV · 2026-04-03 · unverdicted · none · ref 6 · internal anchor
Gram-MMD is a texture-aware realism metric that computes MMD on upper-triangular Gram matrices from backbone activations, providing complementary information to semantic distributional metrics.
Depth Anything 3: Recovering the Visual Space from Any Views cs.CV · 2025-11-13 · unverdicted · none · ref 61 · internal anchor
DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
Emu3.5: Native Multimodal Models are World Learners cs.CV · 2025-10-30 · unverdicted · none · ref 65 · internal anchor
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
ImgEdit: A Unified Image Editing Dataset and Benchmark cs.CV · 2025-05-26 · conditional · none · ref 52 · internal anchor
ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model cs.CV · 2025-03-13 · unverdicted · none · ref 72 · internal anchor
HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
OpenVLA: An Open-Source Vision-Language-Action Model cs.RO · 2024-06-13 · unverdicted · none · ref 26 · internal anchor
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
What Limits Vision-and-Language Navigation ? cs.RO · 2026-05-13 · unverdicted · none · ref 53 · internal anchor
StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.
Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness cs.CV · 2026-05-08 · unverdicted · none · ref 35 · internal anchor
Pan-FM learns balanced representations across seven organs by adaptively masking dominant organs during pre-training, yielding stronger disease prediction and missing-organ robustness than single-organ or naive multimodal baselines on UK Biobank.
TAIL-Safe: Task-Agnostic Safety Monitoring for Imitation Learning Policies cs.RO · 2026-05-02 · unverdicted · none · ref 20 · 2 links · internal anchor
TAIL-Safe learns a Lipschitz Q-function from visibility, recognizability, and graspability criteria in a Gaussian Splatting twin to define an empirical safe set for IL policies and recovers unsafe actions via Nagumo-inspired gradient ascent.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System cs.CV · 2026-04-15 · unverdicted · none · ref 27 · 2 links · internal anchor
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding cs.LG · 2026-04-14 · unverdicted · none · ref 48 · internal anchor
Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.
FF3R: Feedforward Feature 3D Reconstruction from Unconstrained views cs.CV · 2026-04-10 · unverdicted · none · ref 27 · internal anchor
FF3R unifies geometric and semantic 3D reconstruction in a single annotation-free feed-forward network trained solely via RGB and feature rendering supervision.
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details cs.CV · 2025-07-03 · unverdicted · none · ref 43 · internal anchor
MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation cs.CV · 2025-06-03 · unverdicted · none · ref 31 · internal anchor
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.

DINOv2: Learning Robust Visual Features without Supervision

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer