Seg4diff: Unveiling open-vocabulary segmentation in text-to-image diffusion transformers

Chaehyun Kim, Heeseong Shin, Eunbeen Hong, Heeji Yoon, Anurag Arnab, Paul Hongsuck Seo, Sunghwan Hong, Seungryong Kim · 2025 · arXiv 2509.18096

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

extension 1

citation-polarity summary

extend 1

representative citing papers

No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.

TORA: Topological Representation Alignment for 3D Shape Assembly

cs.CV · 2026-04-05 · unverdicted · novelty 7.0

TORA distills topological structure from pretrained 3D encoders into flow-matching backbones via cosine matching and CKA loss, delivering up to 6.9x faster convergence and better accuracy on 3D shape assembly benchmarks with zero inference overhead.

Diffusion Model as a Generalist Segmentation Learner

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.

Spectral Forensics of Diffusion Attention Graphs for Copy-Move Forgery Detection

cs.CV · 2026-04-19 · unverdicted · novelty 6.0

Copy-move forgeries induce detectable spectral shifts in the normalized Laplacian of self-attention graphs from Stable Diffusion U-Nets, identifiable via Wasserstein distance to authentic reference spectra.

Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.

C3G: Learning Compact 3D Representations with 2K Gaussians

cs.CV · 2025-12-03 · unverdicted · novelty 6.0

C3G creates compact 3D Gaussian representations with 2K points by guiding placement via learnable tokens that aggregate multi-view features through attention, yielding better efficiency and performance than dense methods.

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

cs.CV · 2026-05-14

citing papers explorer

Showing 7 of 7 citing papers.

No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos cs.CV · 2026-05-21 · unverdicted · none · ref 32
NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.
TORA: Topological Representation Alignment for 3D Shape Assembly cs.CV · 2026-04-05 · unverdicted · none · ref 25
TORA distills topological structure from pretrained 3D encoders into flow-matching backbones via cosine matching and CKA loss, delivering up to 6.9x faster convergence and better accuracy on 3D shape assembly benchmarks with zero inference overhead.
Diffusion Model as a Generalist Segmentation Learner cs.CV · 2026-04-27 · unverdicted · none · ref 46
DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
Spectral Forensics of Diffusion Attention Graphs for Copy-Move Forgery Detection cs.CV · 2026-04-19 · unverdicted · none · ref 18
Copy-move forgeries induce detectable spectral shifts in the normalized Laplacian of self-attention graphs from Stable Diffusion U-Nets, identifiable via Wasserstein distance to authentic reference spectra.
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models cs.CV · 2026-04-09 · unverdicted · none · ref 16
Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.
C3G: Learning Compact 3D Representations with 2K Gaussians cs.CV · 2025-12-03 · unverdicted · none · ref 31
C3G creates compact 3D Gaussian representations with 2K points by guiding placement via learnable tokens that aggregate multi-view features through attention, yielding better efficiency and performance than dense methods.
Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers cs.CV · 2026-05-14 · unreviewed · ref 4

Seg4diff: Unveiling open-vocabulary segmentation in text-to-image diffusion transformers

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer