PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention
read the original abstract
We propose PostCam, a streamlined framework for novel-view video generation that achieves superior detail preservation and precise camera trajectory editing in dynamic scenes. Current methods often struggle with a trade-off between pose-based control, which lacks visual detail, and rendering-based guidance, which is overly sensitive to geometric accuracy. Despite recent hybrid attempts, achieving precise motion and visual consistency remains challenging due to the lack of effective cross-modal alignment. We argue that robust control stems from the deep alignment of multimodal signals rather than increased input complexity. Our core contribution is the Query-Shared Cross-Attention mechanism, which projects 6-DoF poses and rendered features into a unified latent space. This allows the model to spontaneously achieve intrinsic consistency between motion cues and pixel-level guidance during denoising. Experiments demonstrate that PostCam maintains high-fidelity visual details while outperforming state-of-the-art methods by 20% in trajectory precision, exhibiting superior robustness in complex dynamic scenes. Our project webpage is publicly available at: https://cccqaq.github.io/PostCam.github.io/
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.