Clipself: Vision trans- former distills itself for open-vocabulary dense prediction

· 2023 · arXiv 2310.01403

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

SHED: Style-Homogenized Embedding Alignment for Domain Generalization

cs.CV · 2026-05-16 · conditional · novelty 7.0

SHED improves domain generalization in CLIP by aligning style-homogenized embeddings instead of raw ones, achieving state-of-the-art results on five benchmarks including a 4% gain on DomainNet.

OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.

SubSearch: Intermediate Rewards for Unsupervised Guided Reasoning in Complex Retrieval

cs.IR · 2026-04-08 · unverdicted · novelty 7.0

SubSearch improves LLM reasoning traces on QA and multi-hop QA tasks by rewarding intermediate steps with intrinsic process rewards instead of only final outcomes.

WOW-Seg: A Word-free Open World Segmentation Model

cs.CV · 2026-05-16 · conditional · novelty 6.0

WOW-Seg proposes a word-free open-world segmentation model using Mask2Token and Cascade Attention Mask modules, reporting 89.7 semantic similarity and 82.4 semantic IoU on LVIS with one-eighth the parameters of prior SOTA plus a new 7,662-class benchmark.

Pi-HOC: Pairwise 3D Human-Object Contact Estimation

cs.CV · 2026-04-14 · unverdicted · novelty 6.0 · 2 refs

Pi-HOC predicts dense 3D semantic contacts for all human-object pairs in an image via instance-aware tokens and an InteractionFormer, achieving higher accuracy and 20x throughput than prior methods.

Vision Transformers Need More Than Registers

cs.CV · 2026-02-25 · unverdicted · novelty 6.0

ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.

Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP

cs.CV · 2025-02-26 · conditional · novelty 6.0

Grad-ECLIP produces gradient-based visual and textual explanation heatmaps for CLIP by applying channel and spatial weights to token features instead of relying on sparse self-attention maps.

Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

The approach uses the analytic solution of distribution discrepancy consistency within categories as semantic maps, eliminating training and model-specific modulation while claiming state-of-the-art results on eight benchmarks.

Generalization Under Scrutiny: Cross-Domain Detection Progresses, Pitfalls, and Persistent Challenges

cs.CV · 2026-04-09 · unverdicted · novelty 3.0

A survey that organizes methods for cross-domain object detection into a taxonomy, analyzes domain shift across detection stages, and outlines persistent challenges.

citing papers explorer

Showing 9 of 9 citing papers.

SHED: Style-Homogenized Embedding Alignment for Domain Generalization cs.CV · 2026-05-16 · conditional · none · ref 17
SHED improves domain generalization in CLIP by aligning style-homogenized embeddings instead of raw ones, achieving state-of-the-art results on five benchmarks including a 4% gain on DomainNet.
OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance cs.CV · 2026-04-09 · unverdicted · none · ref 48
OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
SubSearch: Intermediate Rewards for Unsupervised Guided Reasoning in Complex Retrieval cs.IR · 2026-04-08 · unverdicted · none · ref 3
SubSearch improves LLM reasoning traces on QA and multi-hop QA tasks by rewarding intermediate steps with intrinsic process rewards instead of only final outcomes.
WOW-Seg: A Word-free Open World Segmentation Model cs.CV · 2026-05-16 · conditional · none · ref 21
WOW-Seg proposes a word-free open-world segmentation model using Mask2Token and Cascade Attention Mask modules, reporting 89.7 semantic similarity and 82.4 semantic IoU on LVIS with one-eighth the parameters of prior SOTA plus a new 7,662-class benchmark.
Pi-HOC: Pairwise 3D Human-Object Contact Estimation cs.CV · 2026-04-14 · unverdicted · none · ref 29 · 2 links
Pi-HOC predicts dense 3D semantic contacts for all human-object pairs in an image via instance-aware tokens and an InteractionFormer, achieving higher accuracy and 20x throughput than prior methods.
Vision Transformers Need More Than Registers cs.CV · 2026-02-25 · unverdicted · none · ref 37
ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.
Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP cs.CV · 2025-02-26 · conditional · none · ref 28
Grad-ECLIP produces gradient-based visual and textual explanation heatmaps for CLIP by applying channel and spatial weights to token features instead of relying on sparse self-attention maps.
Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation cs.CV · 2026-04-09 · unverdicted · none · ref 46
The approach uses the analytic solution of distribution discrepancy consistency within categories as semantic maps, eliminating training and model-specific modulation while claiming state-of-the-art results on eight benchmarks.
Generalization Under Scrutiny: Cross-Domain Detection Progresses, Pitfalls, and Persistent Challenges cs.CV · 2026-04-09 · unverdicted · none · ref 109
A survey that organizes methods for cross-domain object detection into a taxonomy, analyzes domain shift across detection stages, and outlines persistent challenges.

Clipself: Vision trans- former distills itself for open-vocabulary dense prediction

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer