The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model predictions differently.
Scene parsing through ade20k dataset
7 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 7representative citing papers
UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.
DiTTA distills SAM2 temporal segmentation knowledge into image models via efficient test-time adaptation and a lightweight fusion module to produce annotation-free video semantic segmentation that matches or exceeds fully supervised performance.
SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
SigLino distills SigLIP2 and DINOv3 into efficient vision models via asymmetric relation-knowledge distillation, token-balanced batching, and hierarchical data sampling on a new 200M-image corpus, yielding better transfer to grounding VLMs than training from scratch.
RADSeg adapts the RADIO model with targeted enhancements to deliver 6-30% higher mIoU in zero-shot OVSS while using 2.5x fewer parameters and running 3.95x faster than prior large-model combinations.
The NTIRE 2026 RipDetSeg Challenge evaluated AI methods for rip current detection and segmentation, finding that pretrained general-purpose models with augmentation and post-processing performed well on a diverse multi-country dataset.
citing papers explorer
-
Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP
The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model predictions differently.
-
UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register
UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.
-
Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation
DiTTA distills SAM2 temporal segmentation knowledge into image models via efficient test-time adaptation and a lightweight fusion module to produce annotation-free video semantic segmentation that matches or exceeds fully supervised performance.
-
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
-
SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models
SigLino distills SigLIP2 and DINOv3 into efficient vision models via asymmetric relation-knowledge distillation, token-balanced batching, and hierarchical data sampling on a new 200M-image corpus, yielding better transfer to grounding VLMs than training from scratch.
-
RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models
RADSeg adapts the RADIO model with targeted enhancements to deliver 6-30% higher mIoU in zero-shot OVSS while using 2.5x fewer parameters and running 3.95x faster than prior large-model combinations.
-
NTIRE 2026 Rip Current Detection and Segmentation (RipDetSeg) Challenge Report
The NTIRE 2026 RipDetSeg Challenge evaluated AI methods for rip current detection and segmentation, finding that pretrained general-purpose models with augmentation and post-processing performed well on a diverse multi-country dataset.