DiTs use either a two-stage cross-attention circuit or text-token fusion circuit for spatial relations depending on the text encoder, achieving near-perfect in-domain accuracy but differing out-of-domain robustness.
Video diffu- sion models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
RoboTAG estimates robot poses from monocular images via a topological alignment graph with 2D-3D co-evolution and consistency supervision to alleviate reliance on labeled data.
citing papers explorer
-
Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers
DiTs use either a two-stage cross-attention circuit or text-token fusion circuit for spatial relations depending on the text encoder, achieving near-perfect in-domain accuracy but differing out-of-domain robustness.
-
RoboTAG: End-to-end Robot Configuration Estimation via Topological Alignment Graph
RoboTAG estimates robot poses from monocular images via a topological alignment graph with 2D-3D co-evolution and consistency supervision to alleviate reliance on labeled data.