arxiv: 2604.09429 · v3 · submitted 2026-04-10 · 💻 cs.CV · cs.AI· cs.LG

Recognition: no theorem link

Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

Wonbong Jang , Shikun Liu , Soubhik Sanyal , Juan Camilo Perez , Kam Woh Ng , Sanskar Agrawal , Juan-Manuel Perez-Rua , Yiannis Douratsos

show 1 more author

Tao Xiang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords video diffusion modelcamera pose estimationnovel view synthesisjoint distributionray pixelsraxelsdecoupled attentionclosed-loop consistency

0 comments

The pith

A video diffusion model learns a joint distribution over videos and camera trajectories by encoding rays as pixels in the shared latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that representing each camera as dense ray pixels in the exact same latent space as video frames lets one model denoise both together and handle pose prediction, trajectory-conditioned video generation, and joint synthesis without separate 3D supervision. This unification matters because prior methods treated recovery of camera parameters and novel-view rendering as independent tasks that fail when image coverage is sparse or poses are ambiguous. A sympathetic reader would see the value in a single framework that produces consistent closed-loop results where predicted poses and the videos generated from them agree.

Core claim

The central claim is that rays as pixels, a pixel-aligned encoding of cameras that occupies the identical latent space as video frames, combined with Decoupled Self-Cross Attention for joint denoising, allows a single Video Diffusion Model to learn the joint distribution over videos and camera trajectories, supporting accurate camera-pose prediction from video, video generation along a prescribed trajectory, and joint synthesis of both from input images.

What carries the argument

Dense ray pixels (raxels) – pixel-aligned camera encodings placed in the same latent space as video frames – denoised jointly via Decoupled Self-Cross Attention.

If this is right

The same trained model can predict camera trajectories from video input.
The model can generate video from input images along any pre-defined camera trajectory.
The model can jointly synthesize both video and trajectory from input images.
Predicted poses and the renderings conditioned on those poses remain consistent in self-consistency tests.
Representing cameras in the shared video latent space outperforms Plücker embeddings on the evaluated tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could reduce error accumulation in applications like robotics or augmented reality where pose estimation and view synthesis must stay aligned.
Joint modeling in a shared latent space might extend naturally to other scene attributes such as depth or lighting if they can be encoded similarly.
Longer sequences or more complex dynamic scenes would test whether the raxel representation scales without additional regularization.
The method suggests that many separate 3D-vision modules could be replaced by a single diffusion process if their signals can be cast into a common pixel-like latent format.

Load-bearing premise

Encoding cameras as dense ray pixels in the identical latent space as video frames together with Decoupled Self-Cross Attention is enough to learn a coherent joint distribution that supports accurate pose prediction, controlled generation, and self-consistent closed-loop behavior without extra 3D supervision.

What would settle it

A closed-loop self-consistency test in which the model first predicts a camera trajectory from a video clip and then generates a new video conditioned on that trajectory, where the generated video fails to match the original clip in appearance or motion.

Figures

Figures reproduced from arXiv: 2604.09429 by Juan Camilo Perez, Juan-Manuel Perez-Rua, Kam Woh Ng, Sanskar Agrawal, Shikun Liu, Soubhik Sanyal, Tao Xiang, Wonbong Jang, Yiannis Douratsos.

**Figure 1.** Figure 1: Rays as Pixels: Unifying Video Generation and Camera Pose Estimation. The first two rows show trajectory-controlled video generation from user-defined camera paths, and the last row shows pose estimation recovering a camera trajectory from a raw input video. By representing camera parameters as dense raxels (rays as pixels), our model learns a joint distribution of videos and camera trajectories, enabling … view at source ↗

**Figure 2.** Figure 2: Overview of the Rays as Pixels Framework. Training (left): We jointly encode video frames and their corresponding raxel images into a shared latent space using a frozen spatio-temporal VAE encoder. Video inputs undergo 4× temporal compression, while the temporal dimensions of image inputs remain as they are. The DiT jointly denoises these latents via Decoupled Self-Cross Attention, conditioning on clean so… view at source ↗

**Figure 3.** Figure 3: Qualitative Results on DL3DV. We visualize the NVS performance of our model against the state-of-the-art baseline Kaleido (Liu et al., 2026) on DL3DV. Given a single reference image (first column), our model synthesizes target views (third column) that exhibit superior structural fidelity and lighting consistency, closely matching the ground truth (fourth column). Note that ours predicts the camera paramet… view at source ↗

**Figure 4.** Figure 4: Decoupled Self-Cross Attention. We replace standard self-attention in the video diffusion backbone with a Decoupled Self-Cross Attention block that processes video and ray latents in parallel branches. Within each branch, intra-modal self-attention operates on tokens of the same modality, followed by inter-modal cross-attention where queries from one modality attend to keys and values from the other. This … view at source ↗

**Figure 5.** Figure 5: Qualitative Results on DL3DV-140 following the predefined camera trajectory. We visualize pose-conditioned video generation given a single input image from DL3DV-140 (Ling et al., 2024) and a specific camera path. These scenes present distinct challenges: the top example features a cluttered layout with an off-center subject, while the bottom example contains highly reflective metallic surfaces. Top: The m… view at source ↗

**Figure 6.** Figure 6: Self-Consistency and Ablation Study. Re-generated frames from the cycle self-consistency test on three scenes from DL3DV. From left to right: with Plucker embeddings replacing raxels, without DSCA, without cosine similarity loss, our full model, and ground ¨ truth. Replacing raxels with Plucker embeddings causes the most severe degradation, consistent with the quantitative results in Table ¨ 2. Ablation St… view at source ↗

read the original abstract

Recovering camera parameters from images and rendering scenes from novel viewpoints have been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task depends on what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. To our knowledge, this is the first model to predict camera poses and do camera-controlled video generation within a single framework. We represent each camera as dense ray pixels (raxels), a pixel-aligned encoding that lives in the same latent space as video frames, and denoise the two jointly through a Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, generating video from input images along a pre-defined trajectory, and jointly synthesizing video and trajectory from input images. We evaluate on pose estimation and camera-controlled video generation, and introduce a closed-loop self-consistency test showing that the model's predicted poses and its renderings conditioned on those poses agree. Ablations against Pl\"ucker embeddings confirm that representing cameras in a shared latent space with video is subtantially more effective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Raxels put camera trajectories into the video latent space for joint denoising, which cleanly unifies pose prediction and controlled generation with a useful self-consistency check, but the abstract gives no numbers so the practical gains stay unproven.

read the letter

The main new piece is encoding each camera as dense ray pixels (raxels) that sit in the exact same latent space as the video frames, then denoising both with a decoupled self-cross attention block. This single model is trained to do three things at once: recover trajectories from video, generate video along a given trajectory, and synthesize both from partial input. The closed-loop test, where the model’s own pose predictions are fed back to generate video and the two are checked for agreement, is a practical way to validate consistency without extra 3D labels. The ablation that beats Plücker embeddings also shows why the shared-space choice matters over pure geometric encodings. Those elements are coherent and address a real split in the literature between pose pipelines and view-synthesis pipelines. The approach looks like it could be useful for robotics or AR where you want one generative model to keep camera and content consistent. The soft spot is the lack of any quantitative evidence in what we have: no error rates on pose, no video quality scores, no dataset sizes, and no head-to-head numbers against separate pose estimators or video generators. Without those, it is impossible to tell whether the joint distribution actually improves results or just adds complexity. The “first single framework” claim also needs a careful literature pass once the full paper is read. This is aimed at people already working on video diffusion models who want camera control built in. A reader who cares about consistent generative 3D video would get value from the raxel trick and the attention design even if the final numbers are only average. It deserves a serious referee because the architecture is thoughtful, the consistency test is a solid idea, and the unification problem is worth community feedback. I would send it out for review rather than desk reject.

Referee Report

0 major / 3 minor

Summary. The paper introduces Rays as Pixels, a video diffusion model that learns a joint distribution over videos and camera trajectories. Cameras are encoded as dense ray pixels (raxels) sharing the same latent space as video frames and denoised jointly via Decoupled Self-Cross Attention. A single model performs three tasks: camera trajectory prediction from video, camera-controlled video generation from input images, and joint video-trajectory synthesis. Evaluation covers pose estimation, controlled generation, and a closed-loop self-consistency test, with ablations favoring the shared-space raxel design over Plücker embeddings.

Significance. If the quantitative results and closed-loop test hold, the work is significant for unifying pose estimation and novel-view video synthesis in one generative framework without explicit 3D supervision. The pixel-aligned raxel representation and joint denoising mechanism are technically coherent innovations that directly enable the claimed multi-task capability. Credit is due for the self-consistency evaluation protocol and the Plücker ablation, both of which provide falsifiable checks on the joint-distribution claim.

minor comments (3)

Abstract: 'subtantially' is a typographical error and should read 'substantially'.
Abstract and method description: the precise definition of raxels (how rays are sampled and encoded into the latent space) and the exact architecture of Decoupled Self-Cross Attention would benefit from an early figure or equation block to improve readability for readers unfamiliar with the construction.
Evaluation section: while the closed-loop consistency test is a strength, the manuscript should explicitly state the quantitative metrics used to measure agreement between predicted poses and rendered frames (e.g., pose error thresholds or perceptual consistency scores) so that the test's robustness can be assessed.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of our work, for highlighting the significance of the joint-distribution approach and the self-consistency evaluation, and for recommending minor revision. No specific major comments were raised that require technical changes.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a video diffusion model architecture that encodes camera trajectories as dense raxels sharing the video latent space, then performs joint denoising via Decoupled Self-Cross Attention. This directly enables the three tasks (pose prediction, camera-controlled generation, joint synthesis) and the closed-loop consistency evaluation. No equations, derivations, or self-citations are shown that reduce any claimed prediction or joint distribution to a fitted parameter or input defined by the result itself. Training occurs on external video data with independent ablations (e.g., vs. Plücker embeddings) and external evaluation metrics. The construction is self-contained against benchmarks and does not rely on self-referential definitions or load-bearing prior work by the authors.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven effectiveness of the raxel encoding and the decoupled attention mechanism for learning a joint video-camera distribution; these are introduced without prior independent validation or formal proof.

axioms (1)

domain assumption Video diffusion models can be extended to model joint distributions over appearance and camera geometry when both are represented in a shared latent space.
Invoked by the proposal to denoise video frames and raxels together.

invented entities (1)

Raxels (dense ray pixels) no independent evidence
purpose: Pixel-aligned encoding of camera trajectories placed in the same latent space as video frames for joint denoising.
New representation introduced by the paper; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5549 in / 1516 out tokens · 73722 ms · 2026-05-10T18:16:00.463430+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
cs.CV 2026-05 unverdicted novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

LTX-Video: Realtime Video Latent Diffusion

URL https://storage. googleapis.com/deepmind-media/veo/ Veo-3-Tech-Report.pdf. HaCohen, Y ., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., Panet, P., Weissbuch, S., Kulikov, V ., Bitterman, Y ., Melumian, Z., and Bibi, O. Ltx- video: Realtime video latent diffusion.arXiv preprint arXiv:25...

work page internal anchor Pith review arXiv
[2]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

He, H., Xu, Y ., Guo, Y ., Wetzstein, G., Dai, B., Li, H., and Yang, C. Cameractrl: Enabling camera control for text- to-video generation.arXiv preprint arXiv:2404.02101,

work page internal anchor Pith review arXiv
[3]

Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models.arXiv preprint arXiv:2503.10592, 2025

He, H., Yang, C., Lin, S., Xu, Y ., Wei, M., Gui, L., Zhao, Q., Wetzstein, G., Jiang, L., and Li, H. Cameractrl ii: Dy- namic scene exploration via camera-controlled video dif- fusion models.arXiv preprint arXiv:2503.10592,

work page arXiv
[4]

Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation,

doi: 10.1109/TPAMI.2024.3444912. Hu, W., Gao, X., Li, X., Zhao, S., Cun, X., Zhang, Y ., Quan, L., and Shan, Y . Depthcrafter: Generating consistent long depth sequences for open-world videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2005–2015,

work page doi:10.1109/tpami.2024.3444912 2024
[6]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

URL https://arxiv.org/abs/2509.13414. Kerbl, B., Kopanas, G., Leimk¨uhler, T., and Drettakis, G. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (TOG), 2023a. Kerbl, B., Kopanas, G., Leimk¨uhler, T., and Drettakis, G. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 2023...

work page internal anchor Pith review arXiv
[7]

Collaborative video diffusion: Consistent multi-video generation with camera control

Kuang, Z., Cai, S., He, H., Xu, Y ., Li, H., Guibas, L., and Wetzstein, G. Collaborative video diffusion: Consis- tent multi-video generation with camera control.arXiv preprint arXiv:2405.17414,

work page arXiv
[8]

Cameras as relative positional encoding

Li, R., Yi, B., Liu, J., Gao, H., Ma, Y ., and Kanazawa, A. Cameras as relative positional encoding.arXiv preprint arXiv:2507.10496,

work page arXiv
[9]

arXiv preprint arXiv:2412.12091 (2024)

Liang, H., Cao, J., Goel, V ., Qian, G., Korolev, S., Ter- zopoulos, D., Plataniotis, K., Tulyakov, S., and Ren, J. Wonderland: Navigating 3d scenes from a single image. arXiv preprint arXiv:2412.12091,

work page arXiv
[11]

Depth Anything 3: Recovering the Visual Space from Any Views

URL https:// arxiv.org/abs/2511.10647. Ling, L., Sheng, Y ., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y ., et al. Dl3dv-10k: A large- scale scene dataset for deep learning-based 3d vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

work page internal anchor Pith review arXiv
[12]

Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age.arXiv preprint arXiv:2309.03453, 2023

Liu, Y ., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., and Wang, W. Syncdreamer: Generating multiview- consistent images from a single-view image.arXiv preprint arXiv:2309.03453, 2023b. Long, X., Guo, Y .-C., Lin, C., Liu, Y ., Dou, Z., Liu, L., Ma, Y ., Zhang, S.-H., Habermann, M., Theobalt, C., et al. Wonder3d: Single image to 3d using cross-domain...

work page arXiv
[13]

DreamFusion: Text-to-3D using 2D Diffusion

Technical report. Poole, B., Jain, A., Barron, J. T., and Mildenhall, B. Dream- fusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988,

work page internal anchor Pith review arXiv
[14]

Make-A-Video: Text-to-Video Generation without Text-Video Data

URL https: //arxiv.org/abs/2209.14792. Skorokhodov, I., Tulyakov, S., and Elhoseiny, M. Stylegan- v: A continuous video generator with the price, image quality and perks of stylegan2. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3626–3636,

work page internal anchor Pith review arXiv
[15]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., and Gelly, S. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717,

work page internal anchor Pith review arXiv
[16]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W....

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Motionctrl: A unified and flexible motion controller for video generation, 2024

Wang, S., Leroy, V ., Cabon, Y ., Chidlovskii, B., and Revaud, J. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024a. Wang, S., Leroy, V ., Cabon, Y ., Chidlovskii, B., and Revaud, J. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE Conference on Computer Vis...

work page arXiv
[18]

Video models are zero-shot learners and reasoners

Wang, Z., Yuan, Z., Wang, X., Li, Y ., Chen, T., Xia, M., Luo, P., and Shan, Y . Motionctrl: A unified and flexible motion controller for video generation. InProceedings of SIGGRAPH, 2024c. Wiedemer, T., Li, Y ., Vicol, P., Gu, S. S., Matarese, N., Swer- sky, K., Kim, B., Jaini, P., and Geirhos, R. Video mod- els are zero-shot learners and reasoners.arXiv...

work page internal anchor Pith review arXiv
[19]

Camco: Camera-controllable 3d-consistent image-to-video generation.arXiv preprint arXiv:2406.02509, 2024

Xu, D., Nie, W., Liu, C., Liu, S., Kautz, J., Wang, Z., and Vahdat, A. Camco: Camera-controllable 3d- consistent image-to-video generation.arXiv preprint arXiv:2406.02509,

work page arXiv
[20]

Seeing without pixels: Perception from camera trajecories.arXiv preprint arXiv:2511.21681,

Xue, Z., Grauman, K., Damen, D., Zisserman, A., and Han, T. Seeing without pixels: Perception from camera trajecories.arXiv preprint arXiv:2511.21681,

work page arXiv
[21]

arXiv preprint arXiv:2503.05638 (2025) 18 Liu et al

Yu, M., Hu, W., Xing, J., and Shan, Y . Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025a. Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.-T., Shan, Y ., and Tian, Y . Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.IE...

work page arXiv