pith. sign in

super hub Mixed citations

DINOv2: Learning Robust Visual Features without Supervision

Mixed citation behavior. Most common role is background (44%).

752 Pith papers citing it
Background 44% of classified citations
abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

hub tools

citation-role summary

method 59 background 57 baseline 9 dataset 3 other 1

citation-polarity summary

claims ledger

  • abstract The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques

authors

co-cited works

clear filters

representative citing papers

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

cs.GR · 2026-05-13 · unverdicted · novelty 8.0

Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.

GEAR: Guided End-to-End AutoRegression for Image Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.

Learning 1-Bit LiDAR-based Localization with Auxiliary Objective

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

BiLoc is the first binary neural network framework for 6-DoF LiDAR pose estimation that uses an auxiliary objective to adaptively regulate information retention and achieve SOTA among BNNs on large outdoor datasets.

Scene and Human in One World: Reconstruction in a Feedforward Pass

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

SHOW is a mask-promptable framework coupling feed-forward scene reconstruction with human mesh recovery in a unified metric space to resolve scale ambiguity and improve human-scene alignment from monocular video.

citing papers explorer

Showing 8 of 8 citing papers after filters.

  • Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation cs.GR · 2026-05-13 · unverdicted · none · ref 7 · internal anchor

    Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.

  • Garment Particles: A 2D--3D Symmetric Garment Representation for Generation and Editing cs.GR · 2026-05-25 · unverdicted · none · ref 24 · internal anchor

    Garment Particles is a 5D point cloud representation jointly encoding 2D sewing patterns and 3D geometry, supporting rectified flow generation from high-level inputs and diffusion-based editing of patterns or shapes.

  • Snapshot Polarimetric Display Inverse Rendering cs.GR · 2026-05-24 · unverdicted · none · ref 48 · internal anchor

    A feed-forward transformer estimates per-pixel normal, albedo, roughness, and metallicity from single-shot spectro-polarimetric measurements captured with a polarimetric display and augmented RGB polarization camera, using a generative manifold to expand limited BRDF training data.

  • DualBrep: A Dual-Field Continuous Representation for B-rep Modelling cs.GR · 2026-06-30 · unverdicted · none · ref 7 · internal anchor

    DualBrep encodes B-rep models as dual scalar fields (SDF geometry + UDF topology) compressed into a shared latent space for flow-matching generation and neural B-rep extraction.

  • Generative 3D Gaussians with Learned Density Control cs.GR · 2026-05-08 · unverdicted · none · ref 33 · internal anchor

    DeG models 3D Gaussians via learned octree density and uses VecSeq Sobol re-indexing to turn set generation into sequence modeling, claiming SOTA quality in single-image-to-3D.

  • VVGT: Visual Volume-Grounded Transformer cs.GR · 2026-04-14 · unverdicted · none · ref 19 · internal anchor

    VVGT is a dual-transformer network with Volume Geometry Forcing that maps volumetric data to 3D Gaussian primitives for accurate ray-based rendering without per-scene optimization.

  • MotionDuet: Dual-Conditioned 3D Human Motion Generation with Video-Regularized Text Learning cs.GR · 2025-11-22 · unverdicted · none · ref 7 · internal anchor

    MotionDuet generates realistic controllable 3D human motions via dual text-video conditioning with DUET unified encoding and DASH distribution-aware loss.

  • AssetGen: Deployable 3D Asset Generation at Interactive Speed cs.GR · 2026-05-22 · unverdicted · none · ref 15 · internal anchor

    AssetGen is a system that produces deployable 3D assets including meshes, baked normals, and textures from a single reference image in under 30 seconds via a coarse-to-refine VecSet pipeline and co-designed optimizations.