Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.
super hub Mixed citations
DINOv2: Learning Robust Visual Features without Supervision
Mixed citation behavior. Most common role is background (44%).
abstract
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques
authors
co-cited works
representative citing papers
A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.
Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
A new dataset of 220k+ cross-view pairs and a single-stage geometry-aware model GAGeo based on the π³ 3D foundation model outperforms prior methods on object geo-localization with strong generalization and zero-shot ground-to-drone capability.
First complete digital unwrapping and reading of a Herculaneum papyrus scroll (PHerc. 1667) via synchrotron X-ray CT, virtual unrolling, and machine learning.
Constructs G-equivariant ViTs for arbitrary discrete G ≤ O(2), proves H ≤ G implies G-models embed into H-models and single-head equivariant attention realizes all ordinary G-equivariant maps, introduces D6 hexagonal model, and reports preliminary accuracy gains on PatternNet in low-data regimes.
BiLoc is the first binary neural network framework for 6-DoF LiDAR pose estimation that uses an auxiliary objective to adaptively regulate information retention and achieve SOTA among BNNs on large outdoor datasets.
SHOW is a mask-promptable framework coupling feed-forward scene reconstruction with human mesh recovery in a unified metric space to resolve scale ambiguity and improve human-scene alignment from monocular video.
MIRAGE immunizes images by crafting perturbations that align them with policy-violating concepts in open-source moderation models, triggering refusals in closed-source commercial image editors at over 88% success rate.
Introduces TSMa using text-visual channel interaction and SHARe using ViT layer-aligned autoregressive regression to improve prototype-based few-shot object detection, reporting +10.1 nAP on COCO.
A calibration strategy using full-Jones corrections with an in-field unpolarised calibrator and visibility-based multi-epoch alignment enables sub-arcsecond polarimetric imaging with LOFAR at metre wavelengths.
Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.
Neural reconstruction losses in VAEs reduce latent information content and produce more isotropic latent geometries with even uncertainty distribution.
YARD is a training-free method using Y-shaped decoder architecture and register tokens to improve contrastive decoding for hallucination reduction in LVLMs with lower latency.
A 3D-aware framework uses SAM3D geometry and pose estimation plus geodesic filtering to supervise a lightweight adapter on DINO and Stable Diffusion features, improving semantic correspondence with less manual supervision.
FRUC enables one-shot calibration-free dynamic scene reconstruction from collaborative driving views via a geometric Transformer, ego-centric occlusion priors, and zero-initialized residual denoising, claiming SOTA quality and speed on V2XReal and UrbanIng-V2X.
LoRA-Key creates a standalone user-specific Watermark LoRA trained with a latent watermark prior and GOP, attachable via training-free superposition to protect LoRA ownership while preserving quality.
SeeGroup formulates per-pixel multi-layer depth as a point process with permutation-invariant likelihood to support arbitrary groupings, raising quadruplet relative depth accuracy from 61.34% to 70.09% on the LayeredDepth benchmark.
citing papers explorer
-
RotVLA: Rotational Latent Action for Vision-Language-Action Model
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
-
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation
A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
-
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models
VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.
-
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
-
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
-
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
-
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
-
What Limits Vision-and-Language Navigation ?
StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.
-
TAIL-Safe: Task-Agnostic Safety Monitoring for Imitation Learning Policies
TAIL-Safe learns a Lipschitz Q-function from visibility, recognizability, and graspability criteria in a Gaussian Splatting twin to define an empirical safe set for IL policies and recovers unsafe actions via Nagumo-inspired gradient ascent.
-
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
StereoPolicy fuses left-right image features via cross-attention to deliver consistent gains over RGB, RGB-D, point cloud, and multi-view baselines in simulation and real-robot manipulation tasks.