RayPE extends video DiT attention with Plucker coordinates and a gated reciprocal-product term to improve 3D consistency and camera controllability.
Unified camera positional encoding for controlled video generation
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 9years
2026 9verdicts
UNVERDICTED 9roles
background 1polarities
background 1representative citing papers
CRePE supplies depth-aware positional distributions along curved rays for stable unified-camera control in frozen video DiT models.
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.
DPPE decouples rotation and translation in camera positional encodings for multi-view transformers to resolve late-stage training stagnation and improve generalization in novel view synthesis.
Prisma-World is a diffusion-based multi-agent video model that uses joint full-attention, multi-agent RoPE, and relative camera geometry injection plus curriculum training to produce consistent cross-view videos from flexible agent counts.
StreetNVS presents a multi-sensor conditioned video diffusion framework for street-view novel view synthesis that outperforms baselines with sparse LiDAR and handles extreme out-of-trajectory paths on the Waymo dataset.
Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.
A multi-view video diffusion model conditioned on relative camera poses via extended RoPE generates dense synchronized views from sparse inputs for 4D Gaussian splatting reconstruction, claiming SOTA results on human datasets and generalization to animals.
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher throughput than prior open baselines.
citing papers explorer
-
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.