super hub Mixed citations

DINOv2: Learning Robust Visual Features without Supervision

Huy Vo, Marc Szafraniec, Maxime Oquab, Vasil Khalidov · 2023 · cs.CV · arXiv 2304.07193

Mixed citation behavior. Most common role is background (44%).

676 Pith papers citing it

Background 44% of classified citations

open full Pith review browse 676 citing papers more from Huy Vo arXiv PDF

abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 59 background 57 baseline 9 dataset 3 other 1

citation-polarity summary

background 57 use method 57 baseline 9 unclear 4 use dataset 2

claims ledger

abstract The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques

authors

Huy Vo Marc Szafraniec Maxime Oquab Th\'eo Moutakanni Timoth\'ee Darcet Vasil Khalidov

co-cited works

representative citing papers

Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects

cs.CV · 2026-05-27 · conditional · novelty 8.0

Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.

CalibAnyView: Beyond Single-View Camera Calibration in the Wild

cs.CV · 2026-05-14 · conditional · novelty 8.0

A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

cs.GR · 2026-05-13 · unverdicted · novelty 8.0

Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

cs.CV · 2026-04-13 · unverdicted · novelty 8.0

The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

GEAR: Guided End-to-End AutoRegression for Image Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.

Language-Assisted Super-Resolution from Real-World Low-Resolution Patches

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

LA-SR redefines unpaired super-resolution in language space by projecting images into a semantically rich representation and applying vision-language model guided losses to handle real-world degradations extracted from depth variations.

WarpHammer: Densifying Scene Warps with 3D Object Priors for Extreme View Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.

Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

A new dataset of 220k+ cross-view pairs and a single-stage geometry-aware model GAGeo based on the π³ 3D foundation model outperforms prior methods on object geo-localization with strong generalization and zero-shot ground-to-drone capability.

Complete virtual unwrapping and reading of a rolled Herculaneum papyrus

eess.IV · 2026-06-27 · unverdicted · novelty 7.0

First complete digital unwrapping and reading of a Herculaneum papyrus scroll (PHerc. 1667) via synchrotron X-ray CT, virtual unrolling, and machine learning.

A Unified Framework for Vision Transformers Equivariant to Discrete Subgroups of $\mathrm{O}(2)$

cs.CV · 2026-06-26 · unverdicted · novelty 7.0 · 2 refs

Constructs G-equivariant ViTs for arbitrary discrete G ≤ O(2), proves H ≤ G implies G-models embed into H-models and single-head equivariant attention realizes all ordinary G-equivariant maps, introduces D6 hexagonal model, and reports preliminary accuracy gains on PatternNet in low-data regimes.

Learning 1-Bit LiDAR-based Localization with Auxiliary Objective

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

BiLoc is the first binary neural network framework for 6-DoF LiDAR pose estimation that uses an auxiliary objective to adaptively regulate information retention and achieve SOTA among BNNs on large outdoor datasets.

Scene and Human in One World: Reconstruction in a Feedforward Pass

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

SHOW is a mask-promptable framework coupling feed-forward scene reconstruction with human mesh recovery in a unified metric space to resolve scale ambiguity and improve human-scene alignment from monocular video.

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

MemoBench is a new diagnostic benchmark with 360 synthetic and real clips plus VQA evaluation that tests memory consistency in video models under the disappear-and-reappear paradigm in dynamically changing environments.

MIRAGE: Protecting against Malicious Image Editing via False Moderation

cs.CR · 2026-06-24 · unverdicted · novelty 7.0

MIRAGE immunizes images by crafting perturbations that align them with policy-violating concepts in open-source moderation models, triggering refusals in closed-source commercial image editors at over 88% success rate.

Rethinking Prototype-based Similarity Learning for Few-Shot Object Detection

cs.CV · 2026-06-22 · unverdicted · novelty 7.0

Introduces TSMa using text-visual channel interaction and SHARe using ViT layer-aligned autoregressive regression to improve prototype-based few-shot object detection, reporting +10.1 nAP on COCO.

Polarisation and Faraday rotation measure imaging at metre wavelengths with sub-arcsecond resolution: a foundational calibration strategy

astro-ph.IM · 2026-06-16 · unverdicted · novelty 7.0

A calibration strategy using full-Jones corrections with an in-field unpolarised calibrator and visibility-based multi-epoch alignment enables sub-arcsecond polarimetric imaging with LOFAR at metre wavelengths.

Pano3D: Unified 3D Reconstruction and Panoptic Segmentation

cs.CV · 2026-06-12 · unverdicted · novelty 7.0

Pano3D augments 3D feedforward reconstruction backbones with a set-based mask decoder and joint geometric-semantic training to achieve SOTA 3D panoptic segmentation on ScanNet, ScanNet200, and ScanNet++.

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.

How Neural Losses Shape VAE Latents

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

Neural reconstruction losses in VAEs reduce latent information content and produce more isotropic latent geometries with even uncertainty distribution.

citing papers explorer

Showing 50 of 504 citing papers after filters.

Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects cs.CV · 2026-05-27 · conditional · none · ref 36 · internal anchor
Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.
CalibAnyView: Beyond Single-View Camera Calibration in the Wild cs.CV · 2026-05-14 · conditional · none · ref 35 · internal anchor
A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.
neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing cs.CV · 2026-04-17 · unverdicted · none · ref 39 · internal anchor
neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation cs.CV · 2026-04-13 · unverdicted · none · ref 20 · internal anchor
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 91 · internal anchor
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
GEAR: Guided End-to-End AutoRegression for Image Synthesis cs.CV · 2026-06-30 · unverdicted · none · ref 29 · internal anchor
GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.
Language-Assisted Super-Resolution from Real-World Low-Resolution Patches cs.CV · 2026-06-30 · unverdicted · none · ref 115 · internal anchor
LA-SR redefines unpaired super-resolution in language space by projecting images into a semantically rich representation and applying vision-language model guided losses to handle real-world degradations extracted from depth variations.
WarpHammer: Densifying Scene Warps with 3D Object Priors for Extreme View Synthesis cs.CV · 2026-06-30 · unverdicted · none · ref 35 · internal anchor
WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.
Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization cs.CV · 2026-06-29 · unverdicted · none · ref 27 · internal anchor
A new dataset of 220k+ cross-view pairs and a single-stage geometry-aware model GAGeo based on the π³ 3D foundation model outperforms prior methods on object geo-localization with strong generalization and zero-shot ground-to-drone capability.
A Unified Framework for Vision Transformers Equivariant to Discrete Subgroups of $\mathrm{O}(2)$ cs.CV · 2026-06-26 · unverdicted · none · ref 30 · 2 links · internal anchor
Constructs G-equivariant ViTs for arbitrary discrete G ≤ O(2), proves H ≤ G implies G-models embed into H-models and single-head equivariant attention realizes all ordinary G-equivariant maps, introduces D6 hexagonal model, and reports preliminary accuracy gains on PatternNet in low-data regimes.
Learning 1-Bit LiDAR-based Localization with Auxiliary Objective cs.CV · 2026-06-26 · unverdicted · none · ref 44 · internal anchor
BiLoc is the first binary neural network framework for 6-DoF LiDAR pose estimation that uses an auxiliary objective to adaptively regulate information retention and achieve SOTA among BNNs on large outdoor datasets.
Scene and Human in One World: Reconstruction in a Feedforward Pass cs.CV · 2026-06-26 · unverdicted · none · ref 46 · internal anchor
SHOW is a mask-promptable framework coupling feed-forward scene reconstruction with human mesh recovery in a unified metric space to resolve scale ambiguity and improve human-scene alignment from monocular video.
MemoBench: Benchmarking World Modeling in Dynamically Changing Environments cs.CV · 2026-06-25 · unverdicted · none · ref 50 · internal anchor
MemoBench is a new diagnostic benchmark with 360 synthetic and real clips plus VQA evaluation that tests memory consistency in video models under the disappear-and-reappear paradigm in dynamically changing environments.
Rethinking Prototype-based Similarity Learning for Few-Shot Object Detection cs.CV · 2026-06-22 · unverdicted · none · ref 28 · internal anchor
Introduces TSMa using text-visual channel interaction and SHARe using ViT layer-aligned autoregressive regression to improve prototype-based few-shot object detection, reporting +10.1 nAP on COCO.
Pano3D: Unified 3D Reconstruction and Panoptic Segmentation cs.CV · 2026-06-12 · unverdicted · none · ref 31 · internal anchor
Pano3D augments 3D feedforward reconstruction backbones with a set-based mask decoder and joint geometric-semantic training to achieve SOTA 3D panoptic segmentation on ScanNet, ScanNet200, and ScanNet++.
Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing cs.CV · 2026-05-31 · unverdicted · none · ref 33 · internal anchor
Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.
YARD: Y-Architecture Register Decoding for Efficient Hallucination Mitigation in Large Vision-Language Models cs.CV · 2026-05-29 · unverdicted · none · ref 34 · internal anchor
YARD is a training-free method using Y-shaped decoder architecture and register tokens to improve contrastive decoding for hallucination reduction in LVLMs with lower latency.
Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence cs.CV · 2026-05-28 · unverdicted · none · ref 28 · internal anchor
A 3D-aware framework uses SAM3D geometry and pose estimation plus geodesic filtering to supervise a lightweight adapter on DINO and Stable Diffusion features, improving semantic correspondence with less manual supervision.
FRUC: Feedforward Dynamic Scene Reconstruction from Uncalibrated Collaborative Driving Views cs.CV · 2026-05-28 · unverdicted · none · ref 20 · internal anchor
FRUC enables one-shot calibration-free dynamic scene reconstruction from collaborative driving views via a geometric Transformer, ego-centric occlusion priors, and zero-initialized residual denoising, claiming SOTA quality and speed on V2XReal and UrbanIng-V2X.
SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping cs.CV · 2026-05-27 · unverdicted · none · ref 30 · internal anchor
SeeGroup formulates per-pixel multi-layer depth as a point process with permutation-invariant likelihood to support arbitrary groupings, raising quadruplet relative depth accuracy from 61.34% to 70.09% on the LayeredDepth benchmark.
OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following cs.CV · 2026-05-26 · unverdicted · none · ref 31 · internal anchor
OmniGF adapts VLMs via dual-branch decoding and head embeddings to unify precise multi-person gaze localization with semantic and social reasoning, claiming new SOTA on benchmarks.
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV cs.CV · 2026-05-25 · unverdicted · none · ref 17 · internal anchor
LongAV-Compass is a new benchmark and evaluation framework for minute-scale audio-visual generation across T2AV, I2AV, and V2AV with multi-dimensional assessment.
EVIDENT: Routing MLLM Adaptation through Entity-Grounded Visual Evidence for Cross-Domain Video Temporal Grounding cs.CV · 2026-05-25 · unverdicted · none · ref 37 · internal anchor
EVIDENT routes MLLM adaptation for video temporal grounding through entity-grounded visual evidence using an Entity Bottleneck Adapter, Entity-Binding Distillation, and Entity-to-eVidence gating to improve cross-domain robustness.
Paris 2.0: A Decentralized Diffusion Model for Video Generation cs.CV · 2026-05-25 · unverdicted · none · ref 7 · internal anchor
Paris 2.0 is the first decentralized diffusion model for text-to-video generation and reports roughly 2x lower FVD than a monolithic baseline under matched total compute.
WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation cs.CV · 2026-05-25 · unverdicted · none · ref 73 · internal anchor
WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.
Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence cs.CV · 2026-05-25 · unverdicted · none · ref 34 · internal anchor
GAMSI is a dual-pathway Geometry-Aware MLLM using Metric-Structure Decoupled Queries and Expert-Guided Visual Grounding on RGB inputs alone, trained on a new 152k-sample MTS dataset to reach SOTA on seven spatial benchmarks.
DeltaCam: Differential Intrinsic Camera Modeling for Video Generation cs.CV · 2026-05-24 · unverdicted · none · ref 23 · internal anchor
DeltaCam models relative changes in camera intrinsics via Δ-parameterized neural adaptors in video diffusion models trained on synthetic data to enable controllable generation and real-world transfer.
SliceWorld: A Predictive and Controllable World-State Model for CT Report Generation cs.CV · 2026-05-23 · unverdicted · none · ref 19 · internal anchor
SliceWorld introduces a world-state model for CT report generation that uses predictive and factor-aware objectives on axial slice sequences.
ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views cs.CV · 2026-05-23 · unverdicted · none · ref 22 · internal anchor
ArtSplat is the first feed-forward framework for articulated 3D Gaussian Splatting that reconstructs geometry and joints from sparse multi-state uncalibrated views in one pass.
Loki: Representation over Architecture for Diffusion-Based Portrait Animation cs.CV · 2026-05-22 · unverdicted · none · ref 31 · internal anchor
Loki replaces RGB conditioning stacks with identity-orthogonal parametric face encodings rasterized for diffusion, achieving efficient cross-ID portrait animation without cross-ID training data.
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models cs.CV · 2026-05-22 · unverdicted · none · ref 38 · internal anchor
CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos cs.CV · 2026-05-21 · unverdicted · none · ref 43 · internal anchor
NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.
EventGait: Towards Robust Gait Recognition with Event Streams cs.CV · 2026-05-21 · unverdicted · none · ref 60 · internal anchor
EventGait is a dual-stream spiking and cross-modal framework for event-based gait recognition that matches or exceeds RGB methods in normal conditions and significantly outperforms them in low light, supported by new synthetic event gait benchmarks.
Seeing Through Fog: Towards Fog-Invariant Action Recognition cs.CV · 2026-05-20 · unverdicted · none · ref 27 · internal anchor
Introduces FogAct paired clean-foggy video dataset and FogNet two-stream CLIP model that learns fog-invariant semantic representations via clean-video guidance.
Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning cs.CV · 2026-05-19 · unverdicted · none · ref 16 · internal anchor
Proposes weighted aggregation of clusters and self-distillation-driven token pruning to improve both accuracy and efficiency in ViT-based visual place recognition.
Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models cs.CV · 2026-05-19 · conditional · none · ref 4 · internal anchor
Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation cs.CV · 2026-05-19 · unverdicted · none · ref 43 · 2 links · internal anchor
MSAVBench is the first comprehensive benchmark for multi-shot audio-video generation featuring four dimensions, challenging scenarios, and an adaptive hybrid evaluation framework that achieves 91.5% Spearman correlation with human judgments.
PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation cs.CV · 2026-05-19 · unverdicted · none · ref 56 · internal anchor
PrAda adapts text-prompted segmentation models in a few-shot setting by learning and fusing class-specific prototypes from fine-grained and high-level features, yielding significant gains on semantic, instance, and panoptic segmentation across five benchmarks.
deadtrees.earth-aerial: A Multi-Resolution Aerial Image Dataset for Tree Cover and Mortality Detection cs.CV · 2026-05-19 · accept · none · ref 25 · internal anchor
Releases DTE-aerial-train (385K patches) and DTE-aerial-bench (25 global orthoimages) as the first harmonized multi-resolution datasets for joint tree cover and mortality segmentation across biomes.
CineMatte: Background Matting for Virtual Production and Beyond cs.CV · 2026-05-18 · unverdicted · none · ref 31 · internal anchor
CineMatte uses a cross-attention design on a Siamese DINOv3 ViT plus a pretrained upsampler to produce robust mattes for virtual production, backed by a new non-synthetic 4K VP dataset that supports camera motion.
Best Segmentation Buddies for Image-Shape Correspondence cs.CV · 2026-05-18 · unverdicted · none · ref 43 · internal anchor
The work defines Best Segmentation Buddies as vertices on a 3D shape whose nearest image pixel under distilled features falls inside a given 2D segment, then uses the same features to segment the shape in 3D.
The Velocity Deficit: Initial Energy Injection for Flow Matching cs.CV · 2026-05-14 · unverdicted · none · ref 3 · internal anchor
Flow matching underestimates velocities due to MSE loss leading to integration lag; Initial Energy Injection corrects the start-end asymmetry, improving FID by 44.6% and achieving 5x speedup on ImageNet-1k.
Does Engram Do Memory Retrieval in Autoregressive Image Generation? cs.CV · 2026-05-13 · accept · none · ref 7 · internal anchor
Engram in AR image generation saves backbone FLOPs but trails pure AR baselines in FID and behaves as a gated side-pathway rather than a content-addressed retriever.
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives cs.CV · 2026-05-12 · unverdicted · none · ref 32 · internal anchor
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models cs.CV · 2026-05-12 · unverdicted · none · ref 45 · internal anchor
Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming baselines on a new PAd1M dataset.
PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting cs.CV · 2026-05-12 · unverdicted · none · ref 22 · internal anchor
PointGS achieves semantic-consistent unsupervised 3D point cloud segmentation by using 3D Gaussian Splatting to bridge discrete points and continuous 2D images for distilling SAM semantics.
STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models cs.CV · 2026-05-12 · unverdicted · none · ref 29 · internal anchor
STRIDE boosts diversity in one-step diffusion models by injecting PCA-aligned pink noise into transformer features while preserving text alignment and quality.
Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization cs.CV · 2026-05-11 · unverdicted · none · ref 17 · 2 links · internal anchor
DRoRAE adaptively fuses multi-layer features from vision encoders via energy-constrained routing to enrich visual tokens, cutting rFID from 0.57 to 0.29 and generation FID from 1.74 to 1.65 on ImageNet-256 while revealing a log-linear scaling law with fusion capacity.
When Style Similarity Scores Fail: Diagnosing Raw CSD Cosine in Artist-Style Evaluation cs.CV · 2026-05-09 · conditional · none · ref 9 · internal anchor
Raw CSD cosine similarity produces negative discrimination gaps for many artists and does not support absolute style-fidelity interpretation, but CSLS readout on frozen backbones reduces failures and improves AUC.
PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers cs.CV · 2026-05-08 · unverdicted · none · ref 15 · internal anchor
PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preserving reconstruction quality on ScanNet-50 and 7-Scenes.

DINOv2: Learning Robust Visual Features without Supervision

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer