super hub Mixed citations

DINOv2: Learning Robust Visual Features without Supervision

Huy Vo, Marc Szafraniec, Maxime Oquab, Vasil Khalidov · 2023 · cs.CV · arXiv 2304.07193

Mixed citation behavior. Most common role is background (44%).

587 Pith papers citing it

Background 44% of classified citations

open full Pith review browse 587 citing papers more from Huy Vo arXiv PDF

abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 59 background 57 baseline 9 dataset 3 other 1

citation-polarity summary

background 57 use method 57 baseline 9 unclear 4 use dataset 2

claims ledger

abstract The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques

authors

Huy Vo Marc Szafraniec Maxime Oquab Th\'eo Moutakanni Timoth\'ee Darcet Vasil Khalidov

co-cited works

representative citing papers

CalibAnyView: Beyond Single-View Camera Calibration in the Wild

cs.CV · 2026-05-14 · conditional · novelty 8.0

A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

cs.GR · 2026-05-13 · unverdicted · novelty 8.0

Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

cs.CV · 2026-04-13 · unverdicted · novelty 8.0

The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Polarisation and Faraday rotation measure imaging at metre wavelengths with sub-arcsecond resolution: a foundational calibration strategy

astro-ph.IM · 2026-06-16 · unverdicted · novelty 7.0

A calibration strategy using full-Jones corrections with an in-field unpolarised calibrator and visibility-based multi-epoch alignment enables sub-arcsecond polarimetric imaging with LOFAR at metre wavelengths.

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.

How Neural Losses Shape VAE Latents

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

Neural reconstruction losses in VAEs reduce latent information content and produce more isotropic latent geometries with even uncertainty distribution.

YARD: Y-Architecture Register Decoding for Efficient Hallucination Mitigation in Large Vision-Language Models

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

YARD is a training-free method using Y-shaped decoder architecture and register tokens to improve contrastive decoding for hallucination reduction in LVLMs with lower latency.

Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

A 3D-aware framework uses SAM3D geometry and pose estimation plus geodesic filtering to supervise a lightweight adapter on DINO and Stable Diffusion features, improving semantic correspondence with less manual supervision.

FRUC: Feedforward Dynamic Scene Reconstruction from Uncalibrated Collaborative Driving Views

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

FRUC enables one-shot calibration-free dynamic scene reconstruction from collaborative driving views via a geometric Transformer, ego-centric occlusion priors, and zero-initialized residual denoising, claiming SOTA quality and speed on V2XReal and UrbanIng-V2X.

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.

No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.

EventGait: Towards Robust Gait Recognition with Event Streams

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

EventGait is a dual-stream spiking and cross-modal framework for event-based gait recognition that matches or exceeds RGB methods in normal conditions and significantly outperforms them in low light, supported by new synthetic event gait benchmarks.

Seeing Through Fog: Towards Fog-Invariant Action Recognition

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Introduces FogAct paired clean-foggy video dataset and FogNet two-stream CLIP model that learns fog-invariant semantic representations via clean-video guidance.

Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes weighted aggregation of clusters and self-distillation-driven token pruning to improve both accuracy and efficiency in ViT-based visual place recognition.

Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models

cs.CV · 2026-05-19 · conditional · novelty 7.0

Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.

PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

PrAda adapts text-prompted segmentation models in a few-shot setting by learning and fusing class-specific prototypes from fine-grained and high-level features, yielding significant gains on semantic, instance, and panoptic segmentation across five benchmarks.

deadtrees.earth-aerial: A Multi-Resolution Aerial Image Dataset for Tree Cover and Mortality Detection

cs.CV · 2026-05-19 · accept · novelty 7.0

Releases DTE-aerial-train (385K patches) and DTE-aerial-bench (25 global orthoimages) as the first harmonized multi-resolution datasets for joint tree cover and mortality segmentation across biomes.

CineMatte: Background Matting for Virtual Production and Beyond

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

CineMatte uses a cross-attention design on a Siamese DINOv3 ViT plus a pretrained upsampler to produce robust mattes for virtual production, backed by a new non-synthetic 4K VP dataset that supports camera motion.

Best Segmentation Buddies for Image-Shape Correspondence

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

The work defines Best Segmentation Buddies as vertices on a 3D shape whose nearest image pixel under distilled features falls inside a given 2D segment, then uses the same features to segment the shape in 3D.

citing papers explorer

Showing 50 of 67 citing papers after filters.

RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation cs.RO · 2026-05-17 · unverdicted · none · ref 20 · internal anchor
RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.
RotVLA: Rotational Latent Action for Vision-Language-Action Model cs.RO · 2026-05-13 · unverdicted · none · ref 46 · internal anchor
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies cs.RO · 2026-05-12 · unverdicted · none · ref 36 · internal anchor
DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation cs.RO · 2026-05-12 · conditional · none · ref 21 · internal anchor
A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 29 · internal anchor
VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 59 · internal anchor
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion cs.RO · 2026-05-02 · unverdicted · none · ref 8 · internal anchor
Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 88 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis cs.RO · 2026-04-23 · unverdicted · none · ref 10 · internal anchor
VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation cs.RO · 2026-04-07 · unverdicted · none · ref 21 · internal anchor
AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM3D benchmarks.
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control cs.RO · 2026-03-18 · conditional · none · ref 20 · internal anchor
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness cs.RO · 2026-03-18 · unverdicted · none · ref 22 · internal anchor
HeiSD delivers up to 2.45x faster inference for embodied VLA models by hybridizing speculative decoding with kinematic boundary detection and error-mitigation tricks while preserving task success rates.
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos cs.RO · 2026-02-06 · unverdicted · none · ref 71 · internal anchor
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation cs.RO · 2024-09-03 · conditional · none · ref 5 · internal anchor
ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.
UfM*: Uncertainty from Motion* for DNN Depth Estimation Using Gaussians cs.RO · 2026-05-21 · unverdicted · none · ref 74 · internal anchor
UfM* uses Gaussian mixtures to compute multiview disagreement for uncertainty in depth estimation with single inference per image, reducing energy and memory use.
DexHoldem: Playing Texas Hold'em with Dexterous Embodied System cs.RO · 2026-05-18 · unverdicted · none · ref 42 · internal anchor
DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
LACE: Latent Visual Representation for Cross-Embodiment Learning cs.RO · 2026-05-16 · unverdicted · none · ref 38 · internal anchor
LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.
TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning cs.RO · 2026-05-12 · unverdicted · none · ref 64 · internal anchor
TMRL bridges behavioral cloning pretraining and RL finetuning via diffusion noise and timestep modulation to enable controlled exploration, improving sample efficiency and enabling real-world robot training in under one hour.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 39 · internal anchor
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation cs.RO · 2026-05-06 · unverdicted · none · ref 55 · internal anchor
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation cs.RO · 2026-05-02 · unverdicted · none · ref 17 · internal anchor
Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling cs.RO · 2026-04-21 · unverdicted · none · ref 35 · internal anchor
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation cs.RO · 2026-04-20 · unverdicted · none · ref 31 · internal anchor
ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
One-Shot Cross-Geometry Skill Transfer through Part Decomposition cs.RO · 2026-04-16 · unverdicted · none · ref 17 · internal anchor
Part decomposition with generative shape models allows one-shot robot skill transfer across unfamiliar object geometries in simulation and real settings.
UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception cs.RO · 2026-04-15 · unverdicted · none · ref 21 · internal anchor
UMI-3D integrates LiDAR into the UMI hardware for robust multimodal 3D perception in manipulation demonstrations, yielding higher policy success rates and enabling previously infeasible tasks like deformable object handling.
V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation cs.RO · 2026-04-10 · unverdicted · none · ref 17 · internal anchor
V-CAGE automates the creation of scalable, high-quality robotic manipulation datasets through context-aware scene construction, closed-loop visual verification, and perceptually-driven compression.
RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains cs.RO · 2026-04-06 · unverdicted · none · ref 28 · internal anchor
RoboPlayground reframes robotic manipulation evaluation as a language-driven process over structured physical domains, letting users author varied yet reproducible tasks that reveal policy generalization failures.
The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling cs.RO · 2026-04-03 · unverdicted · none · ref 6 · internal anchor
Discrete action tokenization in VLA models creates an information bottleneck that prevents vision encoder scaling from improving performance, unlike continuous policies, as validated on the LIBERO benchmark.
Emergent Neural Automaton Policies: Learning Symbolic Structure from Visuomotor Trajectories cs.RO · 2026-03-26 · unverdicted · none · ref 44 · internal anchor
ENAP extracts an emergent Mealy automaton from visuomotor trajectories to act as a high-level planner for a low-level residual policy, yielding up to 27% higher success than end-to-end VLA policies in low-data regimes.
Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons cs.RO · 2026-03-02 · unverdicted · none · ref 132 · internal anchor
Robometer combines intra-trajectory progress supervision with inter-trajectory preference supervision on a 1M-trajectory dataset to learn more generalizable robotic reward functions than prior methods.
InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation cs.RO · 2026-02-26 · unverdicted · none · ref 24 · internal anchor
InCoM achieves 23-28% higher success rates in mobile manipulation tasks by inferring motion intent for adaptive perception and decoupling base-arm action generation.
CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion cs.RO · 2026-01-14 · unverdicted · none · ref 44 · internal anchor
CLARE is an exemplar-free continual learning framework for VLAs that autonomously expands modular adapters based on feature similarity and uses autoencoder routing for label-free deployment.
IGen: Scalable Data Generation for Robot Learning from Open-World Images cs.RO · 2025-12-01 · unverdicted · none · ref 48 · internal anchor
IGen generates realistic visuomotor training data including actions and temporally coherent visuals from unstructured open-world images via 3D reconstruction and VLM reasoning.
Unify Robot Actions in Camera Frame cs.RO · 2025-11-21 · conditional · none · ref 29 · internal anchor
CalibAll estimates camera extrinsics on existing datasets to convert robot actions into a unified camera-frame representation, enabling stronger cross-embodiment pretraining.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy cs.RO · 2025-10-15 · unverdicted · none · ref 29 · internal anchor
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training cs.RO · 2025-09-29 · unverdicted · none · ref 13 · internal anchor
World-Env replaces physical robot interactions with a world model-based virtual environment and VLM-guided rewards to enable efficient RL post-training for VLA models, showing gains with only five demonstrations per task.
RAPTOR: A Foundation Policy for Quadrotor Control cs.RO · 2025-09-15 · unverdicted · none · ref 47 · internal anchor
A 2084-parameter recurrent policy trained by distilling 1000 RL teacher policies enables zero-shot control across 10 real quadrotors differing in mass, motors, frames, propellers, and flight controllers.
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation cs.RO · 2025-08-26 · conditional · none · ref 17 · internal anchor
MemoryVLA introduces a perceptual-cognitive memory bank and working-memory retrieval mechanism into VLA models, raising success rates on long-horizon robotic tasks by up to 26 points over prior baselines.
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation cs.RO · 2025-08-07 · unverdicted · none · ref 25 · internal anchor
Genie Envisioner unifies robotic policy learning, simulation, and evaluation inside one instruction-conditioned video diffusion framework using GE-Base, GE-Act, and GE-Sim.
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning cs.RO · 2025-05-24 · conditional · none · ref 55 · internal anchor
VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
Policy Contrastive Decoding for Robotic Foundation Models cs.RO · 2025-05-19 · conditional · none · ref 5 · internal anchor
PCD redirects robotic policies toward object-relevant visual features via contrastive decoding on masked inputs, improving generalization without retraining or weight access.
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data cs.RO · 2025-05-06 · unverdicted · none · ref 71 · internal anchor
GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success cs.RO · 2025-02-27 · accept · none · ref 35 · internal anchor
OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies cs.RO · 2024-12-13 · conditional · none · ref 90 · internal anchor
Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning cs.RO · 2024-11-07 · unverdicted · none · ref 42 · internal anchor
DINO-WM builds world models on pre-trained DINOv2 features to enable zero-shot planning from offline data without rewards or demonstrations.
OpenVLA: An Open-Source Vision-Language-Action Model cs.RO · 2024-06-13 · unverdicted · none · ref 26 · internal anchor
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
ReSiReg: Towards Spatially Consistent Semantics in Language-Conditioned Robotic Tasks cs.RO · 2026-06-17 · unverdicted · none · ref 19 · internal anchor
ReSiReg clusters VLM intermediates into prototypes, derives language descriptors, and reconstructs patches as mixtures to improve spatial consistency in dense language-grounded retrieval for robotics.
Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation cs.RO · 2026-05-20 · unverdicted · none · ref 12 · internal anchor
A hybrid structural latent points representation is learned by inserting a point-wise latent VAE into a point-cloud autoencoder and regularizing toward a Gaussian prior, paired with a lightweight 3DGS rendering pipeline, yielding gains on RLBench and ManiSkill2 benchmarks.
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization cs.RO · 2026-05-17 · unverdicted · none · ref 74 · internal anchor
DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
What Limits Vision-and-Language Navigation ? cs.RO · 2026-05-13 · unverdicted · none · ref 53 · internal anchor
StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.

DINOv2: Learning Robust Visual Features without Supervision

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer