InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
Masked autoencoders are scalable vision learners
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 4years
2026 4verdicts
UNVERDICTED 4representative citing papers
UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.
GlowGS improves 3D Gaussian Splatting in nighttime glow scenes via semantic feature generation from diffusion models and novel-view semantic learning with vision foundation models.
VoxSAMNet introduces sparsity-aware deformable attention via a dummy node and foreground modulation with dropout plus text-guided filtering to reach new state-of-the-art mIoU of 18.2% on SemanticKITTI and 20.2% on SSCBench-KITTI-360 for monocular 3D scene completion.
citing papers explorer
-
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
-
UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register
UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.
-
GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes
GlowGS improves 3D Gaussian Splatting in nighttime glow scenes via semantic feature generation from diffusion models and novel-view semantic learning with vision foundation models.
-
Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion
VoxSAMNet introduces sparsity-aware deformable attention via a dummy node and foreground modulation with dropout plus text-guided filtering to reach new state-of-the-art mIoU of 18.2% on SemanticKITTI and 20.2% on SSCBench-KITTI-360 for monocular 3D scene completion.