QWERTY enables training-free motion control in pretrained image-to-video DiTs by warping the frame-invariant semantic subspace of queries in 3D full attention and using the predicted noise as self-guidance for latent optimization.
arXiv preprint arXiv:2412.07721 (2024)
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
P3Sim integrates a probabilistic physical world model with geometric conditioning and persistent memory to simulate 3D scenes under partial observations and incomplete transforms.
A probabilistic graphical model called 3WM unifies 3D vision tasks into one system that performs them zero-shot by selecting different inference pathways through multimodal scene nodes.
citing papers explorer
-
QWERTY: Training-Free Motion Control via Query-Warped Video Diffusion Transformers
QWERTY enables training-free motion control in pretrained image-to-video DiTs by warping the frame-invariant semantic subspace of queries in 3D full attention and using the predicted noise as self-guidance for latent optimization.
-
Perceptual 3D Simulation With Physical World Modeling
P3Sim integrates a probabilistic physical world model with geometric conditioning and persistent memory to simulate 3D scenes under partial observations and incomplete transforms.
-
Unified 3D Scene Understanding Through Physical World Modeling
A probabilistic graphical model called 3WM unifies 3D vision tasks into one system that performs them zero-shot by selecting different inference pathways through multimodal scene nodes.