SHED improves domain generalization in CLIP by aligning style-homogenized embeddings instead of raw ones, achieving state-of-the-art results on five benchmarks including a 4% gain on DomainNet.
Clipself: Vision trans- former distills itself for open-vocabulary dense prediction
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
SubSearch improves LLM reasoning traces on QA and multi-hop QA tasks by rewarding intermediate steps with intrinsic process rewards instead of only final outcomes.
WOW-Seg proposes a word-free open-world segmentation model using Mask2Token and Cascade Attention Mask modules, reporting 89.7 semantic similarity and 82.4 semantic IoU on LVIS with one-eighth the parameters of prior SOTA plus a new 7,662-class benchmark.
Pi-HOC predicts dense 3D semantic contacts for all human-object pairs in an image via instance-aware tokens and an InteractionFormer, achieving higher accuracy and 20x throughput than prior methods.
ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.
Grad-ECLIP produces gradient-based visual and textual explanation heatmaps for CLIP by applying channel and spatial weights to token features instead of relying on sparse self-attention maps.
The approach uses the analytic solution of distribution discrepancy consistency within categories as semantic maps, eliminating training and model-specific modulation while claiming state-of-the-art results on eight benchmarks.
A survey that organizes methods for cross-domain object detection into a taxonomy, analyzes domain shift across detection stages, and outlines persistent challenges.
citing papers explorer
-
SHED: Style-Homogenized Embedding Alignment for Domain Generalization
SHED improves domain generalization in CLIP by aligning style-homogenized embeddings instead of raw ones, achieving state-of-the-art results on five benchmarks including a 4% gain on DomainNet.
-
OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance
OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
-
SubSearch: Intermediate Rewards for Unsupervised Guided Reasoning in Complex Retrieval
SubSearch improves LLM reasoning traces on QA and multi-hop QA tasks by rewarding intermediate steps with intrinsic process rewards instead of only final outcomes.
-
WOW-Seg: A Word-free Open World Segmentation Model
WOW-Seg proposes a word-free open-world segmentation model using Mask2Token and Cascade Attention Mask modules, reporting 89.7 semantic similarity and 82.4 semantic IoU on LVIS with one-eighth the parameters of prior SOTA plus a new 7,662-class benchmark.
-
Pi-HOC: Pairwise 3D Human-Object Contact Estimation
Pi-HOC predicts dense 3D semantic contacts for all human-object pairs in an image via instance-aware tokens and an InteractionFormer, achieving higher accuracy and 20x throughput than prior methods.
-
Vision Transformers Need More Than Registers
ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.
-
Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP
Grad-ECLIP produces gradient-based visual and textual explanation heatmaps for CLIP by applying channel and spatial weights to token features instead of relying on sparse self-attention maps.
-
Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation
The approach uses the analytic solution of distribution discrepancy consistency within categories as semantic maps, eliminating training and model-specific modulation while claiming state-of-the-art results on eight benchmarks.
-
Generalization Under Scrutiny: Cross-Domain Detection Progresses, Pitfalls, and Persistent Challenges
A survey that organizes methods for cross-domain object detection into a taxonomy, analyzes domain shift across detection stages, and outlines persistent challenges.