super hub Mixed citations

DINOv2: Learning Robust Visual Features without Supervision

Huy Vo, Marc Szafraniec, Maxime Oquab, Vasil Khalidov · 2023 · cs.CV · arXiv 2304.07193

Mixed citation behavior. Most common role is background (44%).

765 Pith papers citing it

Background 44% of classified citations

open full Pith review browse 765 citing papers more from Huy Vo arXiv PDF

abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 59 background 57 baseline 9 dataset 3 other 1

citation-polarity summary

background 57 use method 57 baseline 9 unclear 4 use dataset 2

claims ledger

abstract The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques

authors

Huy Vo Marc Szafraniec Maxime Oquab Th\'eo Moutakanni Timoth\'ee Darcet Vasil Khalidov

co-cited works

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

X-Palm: Paired Multispectral-to-Smartphone Dataset for Cross-Domain Palmprint Authentication

eess.IV · 2026-06-07 · unverdicted · novelty 8.0

X-Palm supplies the first paired multispectral-to-smartphone palmprint dataset with broad real-world variability to support cross-domain biometric authentication.

Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects

cs.CV · 2026-05-27 · conditional · novelty 8.0

Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.

CalibAnyView: Beyond Single-View Camera Calibration in the Wild

cs.CV · 2026-05-14 · conditional · novelty 8.0

A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

cs.GR · 2026-05-13 · unverdicted · novelty 8.0

Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

cs.CV · 2026-04-13 · unverdicted · novelty 8.0

The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Prototype Memory-Guided Training-Free Anomaly Classification and Localization in Prenatal Ultrasound

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

A training-free prototype memory-guided framework for multi-class prenatal ultrasound anomaly classification and localization using few reference images per class, validated on a 9-category multi-center dataset.

EPO: Boosting 3D Foundation Models with Edge-based Pose Optimization

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

EPO is a trackless, edge-map-alignment framework that refines pose estimates from 3D foundation models and matches or exceeds bundle-adjustment performance with substantially lower runtime and memory use.

GEAR: Guided End-to-End AutoRegression for Image Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across quantizers and to text-to-image.

WarpHammer: Densifying Scene Warps with 3D Object Priors for Extreme View Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.

AnyMatch: Supercharging Universal Multi-Modal Image Matching with Large-Scale Single-View Images

cs.CV · 2026-06-30 · unverdicted · novelty 7.0 · 2 refs

AnyMatch synthesizes large-scale geometrically consistent multi-modal image pairs from single-view images, enabling fine-tuned matching networks to achieve substantial gains on benchmarks.

Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

A new dataset of 220k+ cross-view pairs and a single-stage geometry-aware model GAGeo based on the π³ 3D foundation model outperforms prior methods on object geo-localization with strong generalization and zero-shot ground-to-drone capability.

Complete virtual unwrapping and reading of a rolled Herculaneum papyrus

eess.IV · 2026-06-27 · unverdicted · novelty 7.0

First complete digital unwrapping and reading of a Herculaneum papyrus scroll (PHerc. 1667) via synchrotron X-ray CT, virtual unrolling, and machine learning.

Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors

cs.RO · 2026-06-26 · conditional · novelty 7.0

Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.

A Unified Framework for Vision Transformers Equivariant to Discrete Subgroups of $\mathrm{O}(2)$

cs.CV · 2026-06-26 · unverdicted · novelty 7.0 · 2 refs

Constructs G-equivariant ViTs for arbitrary discrete G ≤ O(2), proves H ≤ G implies G-models embed into H-models and single-head equivariant attention realizes all ordinary G-equivariant maps, introduces D6 hexagonal model, and reports preliminary accuracy gains on PatternNet in low-data regimes.

Learning 1-Bit LiDAR-based Localization with Auxiliary Objective

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

BiLoc is the first binary neural network framework for 6-DoF LiDAR pose estimation that uses an auxiliary objective to adaptively regulate information retention and achieve SOTA among BNNs on large outdoor datasets.

Scene and Human in One World: Reconstruction in a Feedforward Pass

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

SHOW is a mask-promptable framework coupling feed-forward scene reconstruction with human mesh recovery in a unified metric space to resolve scale ambiguity and improve human-scene alignment from monocular video.

MIRAGE: Protecting against Malicious Image Editing via False Moderation

cs.CR · 2026-06-24 · unverdicted · novelty 7.0

MIRAGE immunizes images by crafting perturbations that align them with policy-violating concepts in open-source moderation models, triggering refusals in closed-source commercial image editors at over 88% success rate.

Rethinking Prototype-based Similarity Learning for Few-Shot Object Detection

cs.CV · 2026-06-22 · unverdicted · novelty 7.0

Introduces TSMa using text-visual channel interaction and SHARe using ViT layer-aligned autoregressive regression to improve prototype-based few-shot object detection, reporting +10.1 nAP on COCO.

citing papers explorer

Showing 50 of 765 citing papers.

GenHSI: Controllable Generation of Human-Scene Interaction Videos cs.CV · 2025-06-24 · unverdicted · none · ref 66 · internal anchor
GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D inpainting plus optimization, and then feeding them to pre-trained video diffusion.
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models cs.CV · 2025-06-10 · unverdicted · none · ref 71 · internal anchor
AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.
A European Multi-Center Breast Cancer MRI Dataset eess.IV · 2025-05-31 · unverdicted · none · ref 13 · internal anchor
Releases a new public multi-center European breast MRI dataset of 741 cases with heterogeneous protocols and provides baseline transformer model benchmarks.
FractalMamba++: Scaling Vision Mamba Across Resolutions via Hilbert Fractal Geometry cs.CV · 2025-05-20 · unverdicted · none · ref 8 · internal anchor
FractalMamba++ scales Vision Mamba across resolutions by using Hilbert fractal serialization, hierarchy-based skip connections, and fractal-aware 2D rotary position encoding.
VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold cs.CV · 2025-05-18 · unverdicted · none · ref 47 · internal anchor
VGGT-SLAM aligns VGGT submaps via SL(4) manifold optimization of 15-DoF homographies to enable consistent dense RGB SLAM on long uncalibrated monocular videos.
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer cs.CV · 2025-04-29 · unverdicted · none · ref 48 · internal anchor
ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.
Toward Generalizable Forgery Detection and Reasoning cs.CV · 2025-03-27 · unverdicted · none · ref 78 · internal anchor
FakeReasoning is an MLLM-based framework for unified forgery detection and reasoning on AI-generated images, supported by the new MMFR-Dataset of 120K images and 378K annotations across 10 generators.
Adaptive Camera Sensor for Vision Models cs.CV · 2025-03-04 · unverdicted · none · ref 13 · internal anchor
Lens adapts camera sensors in real time via the VisiT confidence-based quality indicator to improve vision model accuracy on domain-shifted images, shown on ImageNet-ES and a new diverse benchmark.
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation cs.RO · 2024-09-03 · conditional · none · ref 5 · internal anchor
ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models cs.CV · 2024-07-10 · unverdicted · none · ref 44 · internal anchor
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
Massive Activations in Large Language Models cs.CL · 2024-02-27 · unverdicted · none · ref 140 · internal anchor
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
Project Aria: A New Tool for Egocentric Multi-Modal AI Research cs.HC · 2023-08-24 · accept · none · ref 16 · internal anchor
Project Aria presents a new wearable egocentric multi-modal recording device and software tools to accelerate AI research for augmented reality applications.
RoMa: Robust Dense Feature Matching cs.CV · 2023-05-24 · unverdicted · none · ref 37 · internal anchor
RoMa sets new state-of-the-art dense feature matching performance by fusing DINOv2 features with local ConvNet features, using anchor-probability transformer decoding, and regression-by-classification loss, with a 36% gain on WxBS.
Personalized Object Identification and Localization via In-Context Inference with Vision-Language Models cs.CV · 2026-07-01 · unverdicted · none · ref 23 · internal anchor
IPLoc-ID extends prior localization-only work to full identification and localization by using a self-posed query in VLMs to reject negative images while preserving comparable localization accuracy.
Does Your ViT Still Need U-Net for Segmentation? cs.CV · 2026-06-30 · unverdicted · none · ref 30 · internal anchor
EoSeg shows that modern ViT backbones support accurate medical image segmentation without U-Net-style decoders via multi-level query modeling and learnable block fusion, with strong results on seven benchmarks.
Decompose, Compare, and Decide: Multimodal LLMs are Implicit Few-Shot Learners cs.CV · 2026-06-30 · unverdicted · none · ref 30 · internal anchor
DeCoDe decomposes few-shot classification into binary pairwise image comparisons whose affirmative logits serve as similarity scores, enabling strong performance from unmodified MLLMs on twelve datasets.
Lost in the Tail: Addressing Geographic Imbalance in Urban Visual Place Recognition cs.CV · 2026-06-30 · unverdicted · none · ref 38 · internal anchor
DAPR is a model-agnostic plug-in that rebalances gradient contributions across head and tail classes and applies multi-scale distance search for distributional compactness, improving VPR performance by 18.3% on SF-XL v1 and 6.7% on v2.
Towards Voxel Spacing Consistency for Medical Image Segmentation cs.CV · 2026-06-30 · unverdicted · none · ref 65 · internal anchor
Consispace is a semantic-aware resampling method that uses an implicit neural network with ODE constraints and feature reweighting to achieve consistent axial voxel spacing while preserving anatomy and semantics, improving downstream segmentation.
PrISM-IQA: Image Quality Assessment Made Practical for Smartphone Photography cs.CV · 2026-06-30 · unverdicted · none · ref 34 · internal anchor
PrISM-IQA reformulates IQA as multi-issue ordinal diagnosis predicting absent/minor/severe/critical levels for 53 ISP issues using cumulative encoding and structured inference.
DPPE: Rethinking Camera-Based Positional Encoding for Scaling Multi-View Transformers cs.CV · 2026-06-30 · unverdicted · none · ref 48 · internal anchor
DPPE decouples rotation and translation in camera positional encodings for multi-view transformers to resolve late-stage training stagnation and improve generalization in novel view synthesis.
DualBrep: A Dual-Field Continuous Representation for B-rep Modelling cs.GR · 2026-06-30 · unverdicted · none · ref 7 · internal anchor
DualBrep encodes B-rep models as dual scalar fields (SDF geometry + UDF topology) compressed into a shared latent space for flow-matching generation and neural B-rep extraction.
Delta-JEPA: Learning Action-Sensitive World Models via Latent Difference Decoding cs.AI · 2026-06-30 · unverdicted · none · ref 10 · internal anchor
Delta-JEPA augments latent forward prediction with a Latent Difference Action Decoder that reconstructs actions from embedding displacements, yielding action-sensitive world models that improve planning on four visual continuous-control tasks over JEPA baselines.
WildProp: Visual Estimation of Wildlife Body Proportions at Scale cs.CV · 2026-06-30 · unverdicted · none · ref 50 · internal anchor
A retrieval-based framework using foundation models for pose-aware correspondence to estimate population-level wildlife body proportions from unconstrained images, with reported 10-20% median relative errors on bird and amphibian datasets.
GROW$^2$: Grounding Which and Where for Robot Tool Use cs.RO · 2026-06-29 · unverdicted · none · ref 62 · internal anchor
GROW² hierarchically grounds open-world tool affordances by using VLMs for semantic selection of objects and parts followed by geometric localization with vision foundation models.
Sequential Planning via Anchored Robotic Keypoints cs.RO · 2026-06-29 · unverdicted · none · ref 27 · internal anchor
SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.
Benchmark AUC Is Not Deployable Reliability: A Cross-Dataset Audit of Off-the-Shelf Features for Surveillance Video Anomaly Detection cs.CV · 2026-06-28 · unverdicted · none · ref 13 · internal anchor
Cross-dataset testing of nearest-neighbor and Mahalanobis anomaly detectors on CLIP, DINOv2, ResNet-50 and EfficientNet embeddings shows same-dataset AUC averaging 0.704 dropping to 0.499 on other datasets, with false-alarm rates around 31,931 per hour at usable operating points.
Rectifying Mask via Entropy for Distractor-Free 3DGS in Ambiguous Scenarios cs.CV · 2026-06-28 · unverdicted · none · ref 36 · internal anchor
RefineSplat applies entropy-aware adaptive masking and density control to 3DGS to remove color- or semantically ambiguous distractors, validated on a new 18-scene Ambiguous wild dataset with claimed SOTA results.
Dynamic Parsing and Updating Natural Language Specification using VLMs for Robust Vision-Language Tracking cs.CV · 2026-06-28 · unverdicted · none · ref 31 · internal anchor
A language dependency parsing mechanism combined with Qwen-VL enables adaptive updates to textual descriptions for improved vision-language tracking performance on benchmarks like TNL2K and LaSOT.
Multi-scale Object-Aware Gaze Estimation via Geometric Reasoning cs.CV · 2026-06-28 · unverdicted · none · ref 41 · internal anchor
A two-stage object-aware gaze estimation method with multi-scale feature fusion and geometric constraints reports AUC scores of 0.961, 0.948, 0.987, and 0.977 on GazeFollow, VideoAttentionTarget, ChildPlay, and GOO-Real with a 7.1M parameter model.
MoPe: Motion Permanence for Robust Monocular Gaussian Mapping in Dynamic Environments cs.RO · 2026-06-28 · unverdicted · none · ref 13 · internal anchor
MoPe propagates historical dynamic posteriors via SE(3) warping and bounded Bayesian fusion to maintain persistent motion state in monocular Gaussian SLAM.
Flow Matching in Feature Space for Stochastic World Modeling cs.CV · 2026-06-27 · unverdicted · none · ref 61 · internal anchor
FlowWM applies flow matching directly in pretrained feature space with a one-step projection mechanism, improving perception accuracy, mode coverage, and horizon robustness on synthetic and real-world benchmarks.
Envisage: Diffusion-Based Rhinoplasty Goal Visualization with Mask-Decomposed Evaluation eess.IV · 2026-06-26 · unverdicted · none · ref 17 · internal anchor
Envisage applies FLUX.1 inpainting to rhinoplasty goal visualization and shows via SurgicalScore that mask-decomposed metrics outperform full-face identity scores for hard-composited localized edits.
Learning Topology-Aware Representations via Test-Time Adaptation for Anomaly Segmentation cs.CV · 2026-06-26 · unverdicted · none · ref 96 · internal anchor
TopoTTA integrates persistent homology into test-time adaptation to derive topological pseudo-labels from anomaly maps, improving segmentation by an average 15% F1 on six benchmarks while generalizing across 2D and 3D data.
VLM-Aware Meta-Optic Front-End Design for Frozen Vision-Language Models cs.CV · 2026-06-26 · unverdicted · none · ref 29 · internal anchor
CODA optimizes continuous-density meta-optics via adjoint gradients on Maxwell simulations to boost frozen CLIP zero-shot accuracy on ImageNet-100 from 53.75% to 65.41%, with transfer to other models.
Beyond Points: Spherical Distributional Part Prototypes for Interpretable Classification cs.CV · 2026-06-25 · unverdicted · none · ref 14 · 2 links · internal anchor
vMFProto models each class as a mixture of von Mises-Fisher components on the hypersphere, learns per-prototype concentrations, and applies entropic OT for assignments, yielding SOTA explanation quality on CUB, Dogs, and Cars with frozen DINO backbones.
ViQ: Text-Aligned Visual Quantized Representations at Any Resolution cs.CV · 2026-06-25 · unverdicted · none · ref 11 · internal anchor
ViQ is a new two-stage text-aligned quantization method for visual features supporting arbitrary resolutions that claims competitive multimodal performance with efficiency gains of 20-70%.
SatSplatDiff: Geometry-preserving generative refinement for high-fidelity satellite Gaussian Splatting cs.CV · 2026-06-25 · unverdicted · none · ref 88 · internal anchor
SatSplatDiff combines depth supervision and shadow-guided generative refinement with 2DGS to reduce geometric MAE by up to 18% and improve visual fidelity by 28-45% on satellite datasets while enabling 5x resolution enhancement.
Forget, Anticipate and Adapt: Test Time Training for Long Videos cs.CV · 2026-06-25 · unverdicted · none · ref 1 · internal anchor
FFN performs TTT on multi-hour videos by restricting updates to three frames and using a surprise metric for adaptive window sizing, plus a new EpicTours dataset.
MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation cs.CV · 2026-06-24 · unverdicted · none · ref 30 · internal anchor
MIMFlow uses a VAE on masked images to feed semantic latents to a normalizing flow while a decoder handles high-frequency details, reporting FID 2.50 and 71.3% linear probing on ImageNet 256x256 with 128 tokens.
Invoice Haystack: Benchmarking Document Retrieval and Visual Question Answering Under Strong Visual Homogeneity cs.CV · 2026-06-24 · unverdicted · none · ref 56 · internal anchor
Presents Invoice Haystack benchmark for homogeneous document retrieval and VL-RAG hybrid framework achieving 60% Recall@1 and up to 13.5 point gains over prior methods.
HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers cs.CV · 2026-06-11 · unverdicted · none · ref 100 · internal anchor
HYDRA-X presents the first unified multimodal model using a single ViT for holistic image-video tokenization, with ablations on attention and compression plus a latent-level editing improvement.
ViPER: Vision-based Packing-Aware Encoder for Robust Malware Detection cs.CR · 2026-06-11 · unverdicted · none · ref 19 · internal anchor
ViPER uses a LoRA-adapted ViT-B/14 with dual heads for malware classification and packing detection plus a gating mechanism and weighted losses to reach 0.8521 balanced accuracy on 200k Windows PE images while detecting packing at 0.9949 AUC.
Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning cs.CV · 2026-06-10 · unverdicted · none · ref 32 · internal anchor
DSSA decouples per-frame appearance from temporal identity in slot attention mechanisms to reduce slot swapping and improve temporal consistency in video object segmentation.
Action-Effect Memory Pretraining for Robot Manipulation cs.RO · 2026-06-10 · unverdicted · none · ref 42 · internal anchor
AEM pretrains compact history representations via masked modeling on interleaved vision-action sequences to boost downstream robot manipulation in simulation and real settings.
Tac-DINO: Learning Vision-Tactile Features with Patch Alignment cs.CV · 2026-06-10 · unverdicted · none · ref 158 · internal anchor
Tac-DINO constructs a large tactile dataset and Vis-Tac Holographic Matching Benchmark, then proposes Vision-Tactile Patch Alignment (VTPA) methods that outperform non-aligned baselines on local-to-global feature matching.
LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation cs.CV · 2026-06-10 · unverdicted · none · ref 29 · internal anchor
LASA aggregates multi-layer attention from vision transformers to enable weakly supervised open-vocabulary semantic segmentation on scene sketches, reporting mIoU gains of +3.43 to +15.74 on three benchmarks over prior baselines.
Cross-Modal Benchmarking for Robotic Perception in Natural Environments cs.CV · 2026-06-10 · unverdicted · none · ref 18 · internal anchor
Presents the WildCross benchmark with 476K frames for place recognition and metric depth estimation in natural environments, demonstrating limitations of existing vision models.
TacForeSight: Force-Guided Tactile World Model for Contact-Rich Manipulation cs.RO · 2026-06-09 · unverdicted · none · ref 36 · internal anchor
TacForeSight trains a force-conditioned tactile world model to predict latent dynamics and uses those predictions as anticipatory priors inside a visuo-tactile policy for real-time contact-rich manipulation.
Envision4D: Envisioning Visual Futures via Feed-forward 4D Gaussian Splatting for Autonomous Driving cs.CV · 2026-06-09 · unverdicted · none · ref 55 · internal anchor
Envision4D presents a feed-forward 4D Gaussian Splatting framework with future pose prediction, temporal attention, and conditioned motion lifting for pose-free extrapolation in autonomous driving scenes.
Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning cs.LG · 2026-06-09 · unverdicted · none · ref 16 · internal anchor
BFQ enables single-step noise-to-action mapping in offline RL by dividing flow-path displacements into bootstrappable short-range components learned from marginal velocity.

DINOv2: Learning Robust Visual Features without Supervision

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer