VideoCoF adds an explicit reasoning step using edit-region latents in video diffusion models to enable precise mask-free editing and motion alignment with only 50k training pairs.
Emerg- ing properties in self-supervised vision transformers
8 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
FastVGGT achieves 4x speedup on VGGT for 1000-image inputs using training-free token merging tailored to 3D architectures while reducing error accumulation.
SeeCo is a training-free on-the-fly recalibration method using multi-view geometric consistency and adaptive textual calibration to improve open-vocabulary semantic segmentation in remote sensing images.
VGGT-Segmentor achieves new SOTA cross-view segmentation on Ego-Exo4D (67.7% Ego-to-Exo, 68.0% Exo-to-Ego IoU) via geometry-enhanced features, a three-stage segmentation head, and correspondence-free pretraining.
TG-DP decouples reconstruction and alignment objectives into separate paths with teacher guidance on visibility patterns, yielding SOTA zero-shot audio-video retrieval gains on AudioSet.
PEPR reframes learning with privileged event data as predicting latent event features from RGB to improve domain generalization in object detection and segmentation without direct cross-modal alignment.
RADSeg adapts the RADIO model with targeted enhancements to deliver 6-30% higher mIoU in zero-shot OVSS while using 2.5x fewer parameters and running 3.95x faster than prior large-model combinations.
CLIP-guided selection of external data plus staged NAFNet training and inference fusion provides an effective pipeline for nighttime image dehazing in the NTIRE 2026 challenge.
citing papers explorer
-
VideoCoF: Unified Video Editing with Temporal Reasoner
VideoCoF adds an explicit reasoning step using edit-region latents in video diffusion models to enable precise mask-free editing and motion alignment with only 50k training pairs.
-
FastVGGT: Training-Free Acceleration of Visual Geometry Transformer
FastVGGT achieves 4x speedup on VGGT for 1000-image inputs using training-free token merging tailored to 3D architectures while reducing error accumulation.
-
Seeking Consensus: Geometric-Semantic On-the-Fly Recalibration for Open-Vocabulary Remote Sensing Semantic Segmentation
SeeCo is a training-free on-the-fly recalibration method using multi-view geometric consistency and adaptive textual calibration to improve open-vocabulary semantic segmentation in remote sensing images.
-
VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
VGGT-Segmentor achieves new SOTA cross-view segmentation on Ego-Exo4D (67.7% Ego-to-Exo, 68.0% Exo-to-Ego IoU) via geometry-enhanced features, a three-stage segmentation head, and correspondence-free pretraining.
-
Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
TG-DP decouples reconstruction and alignment objectives into separate paths with teacher guidance on visibility patterns, yielding SOTA zero-shot audio-video retrieval gains on AudioSet.
-
PEPR: Privileged Event-based Predictive Regularization for Domain Generalization
PEPR reframes learning with privileged event data as predicting latent event features from RGB to improve domain generalization in object detection and segmentation without direct cross-modal alignment.
-
RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models
RADSeg adapts the RADIO model with targeted enhancements to deliver 6-30% higher mIoU in zero-shot OVSS while using 2.5x fewer parameters and running 3.95x faster than prior large-model combinations.
-
CLIP-Guided Data Augmentation for Night-Time Image Dehazing
CLIP-guided selection of external data plus staged NAFNet training and inference fusion provides an effective pipeline for nighttime image dehazing in the NTIRE 2026 challenge.