Scaling Sequence-to-Sequence Generative Neural Rendering
Pith reviewed 2026-05-18 10:05 UTC · model grok-4.3
The pith
A single sequence-to-sequence transformer pretrained on video data matches per-scene 3D optimization quality for view synthesis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Kaleido frames 3D as a sub-domain of video and performs photorealistic neural rendering through masked autoregressive sequence-to-sequence synthesis inside a decoder-only rectified flow transformer. The model generates any number of 6-DoF target views conditioned on any number of reference views, unifies 3D and video modeling in a single architecture, and uses video pretraining to achieve new state-of-the-art results on view synthesis benchmarks, with zero-shot performance that substantially exceeds other generative methods in few-view settings and matches per-scene optimization methods in many-view settings.
What carries the argument
A decoder-only rectified flow transformer that performs masked autoregressive sequence-to-sequence image synthesis, enabling unified generative view synthesis without explicit 3D representations.
Load-bearing premise
That the geometric structure and consistency of 3D scenes can be fully captured by treating them as ordinary image sequences in a video-style generative model without dedicated 3D components.
What would settle it
A controlled experiment on a standard view synthesis benchmark showing persistent geometric inconsistencies or lower quality than per-scene optimization methods even after video pretraining and with many input views available.
read the original abstract
We present Kaleido, a family of generative models designed for photorealistic, unified object- and scene-level neural rendering. Kaleido operates on the principle that 3D can be regarded as a specialised sub-domain of video, expressed purely as a sequence-to-sequence image synthesis task. Through a systemic study of scaling sequence-to-sequence generative neural rendering, we introduce key architectural innovations that enable our model to: i) perform generative view synthesis without explicit 3D representations; ii) generate any number of 6-DoF target views conditioned on any number of reference views via a masked autoregressive framework; and iii) seamlessly unify 3D and video modelling within a single decoder-only rectified flow transformer. Within this unified framework, Kaleido leverages large-scale video data for pre-training, which significantly improves spatial consistency and reduces reliance on scarce, camera-labelled 3D datasets -- all without any architectural modifications. Kaleido sets a new state-of-the-art on a range of view synthesis benchmarks. Its zero-shot performance substantially outperforms other generative methods in few-view settings, and, for the first time, matches the quality of per-scene optimisation methods in many-view settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Kaleido, a family of generative models for photorealistic unified object- and scene-level neural rendering. It treats 3D as a specialized sub-domain of video expressed purely as a sequence-to-sequence image synthesis task, implemented via a decoder-only rectified flow transformer with a masked autoregressive framework. The model performs generative view synthesis without explicit 3D representations, generates any number of 6-DoF target views from any number of reference views, and unifies 3D and video modeling. It leverages large-scale video pre-training to improve spatial consistency without architectural modifications and claims new state-of-the-art results on view synthesis benchmarks, with zero-shot performance outperforming other generative methods in few-view settings and, for the first time, matching per-scene optimization methods in many-view settings.
Significance. If the central claims hold, the work would be significant for demonstrating that a single unmodified decoder-only rectified flow transformer pretrained on video data can achieve high-quality neural rendering across 3D and video domains. This could reduce reliance on scarce camera-labeled 3D datasets, simplify architectures by avoiding explicit 3D representations, and advance unified generative modeling. The scaling study and reported parity with per-scene methods like NeRF/3DGS in many-view regimes would represent a notable contribution if supported by rigorous controls for geometric consistency.
major comments (2)
- [Abstract] Abstract: The claim that Kaleido 'for the first time, matches the quality of per-scene optimisation methods in many-view settings' is central to the contribution. This depends on the video-pretrained sequence model implicitly capturing accurate multi-view geometry and cross-view consistency without explicit 3D structure or architectural changes. Standard 2D metrics (PSNR/SSIM) can mask inconsistencies; the manuscript should provide explicit multi-view consistency analysis or 3D-aware metrics to substantiate parity with NeRF/3DGS.
- [Abstract] Abstract: The text simultaneously highlights 'key architectural innovations' that enable the unified framework and states that video pre-training occurs 'without any architectural modifications.' This tension is load-bearing for the seamless-unification premise and requires clarification on the precise nature of any innovations versus the unmodified decoder-only rectified flow transformer.
minor comments (2)
- [Abstract] Abstract: Define 'few-view' and 'many-view' settings quantitatively (e.g., number of reference views) and list the specific benchmarks and baselines used to support the SOTA and parity claims.
- [Methods] The masked autoregressive generation procedure for arbitrary numbers of 6-DoF views would benefit from a dedicated methods subsection with pseudocode or a clear conditioning diagram.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments on our manuscript. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the presentation of our claims without altering the core technical contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that Kaleido 'for the first time, matches the quality of per-scene optimisation methods in many-view settings' is central to the contribution. This depends on the video-pretrained sequence model implicitly capturing accurate multi-view geometry and cross-view consistency without explicit 3D structure or architectural changes. Standard 2D metrics (PSNR/SSIM) can mask inconsistencies; the manuscript should provide explicit multi-view consistency analysis or 3D-aware metrics to substantiate parity with NeRF/3DGS.
Authors: We appreciate the referee's emphasis on rigorous validation of geometric consistency. While PSNR and SSIM remain the community-standard metrics for direct comparison against per-scene optimization baselines such as NeRF and 3DGS (as used in prior generative view synthesis works), we acknowledge that these image-level metrics alone can obscure certain multi-view inconsistencies. Our manuscript already includes extensive qualitative visualizations across diverse scenes demonstrating cross-view coherence, and the scaling experiments show that video pre-training measurably improves spatial consistency. To further substantiate the parity claim, we will add a dedicated subsection in the Experiments section reporting additional 3D-aware analyses, including average cross-view optical flow consistency and reprojection error statistics computed on held-out views where camera poses permit. These additions will be included in the revised manuscript. revision: yes
-
Referee: [Abstract] Abstract: The text simultaneously highlights 'key architectural innovations' that enable the unified framework and states that video pre-training occurs 'without any architectural modifications.' This tension is load-bearing for the seamless-unification premise and requires clarification on the precise nature of any innovations versus the unmodified decoder-only rectified flow transformer.
Authors: We thank the referee for identifying this potential source of confusion. The phrase 'key architectural innovations' in the abstract refers to the high-level methodological contributions: (i) the sequence-to-sequence formulation that treats 3D view synthesis as a specialized video modeling task, (ii) the masked autoregressive generation mechanism enabling arbitrary numbers of 6-DoF target views from arbitrary reference views, and (iii) the unified training paradigm that allows a single model to handle both 3D and video data. These are innovations in task formulation, data representation, and training strategy. Critically, the underlying model remains an unmodified decoder-only rectified flow transformer; no architectural changes (such as custom 3D positional encodings, added geometric layers, or modifications to the attention or flow mechanisms) are introduced. Video pre-training is performed directly on this standard architecture. We will revise the abstract wording to explicitly distinguish between these methodological innovations and the absence of modifications to the transformer backbone itself. revision: yes
Circularity Check
No circularity: unification premise is an explicit modeling choice, performance claims rest on external benchmarks
full rationale
The paper's central premise—that 3D can be treated as a specialized sub-domain of video and modeled as sequence-to-sequence image synthesis in an unmodified decoder-only rectified flow transformer—is presented as a design decision rather than a derived result. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or uniqueness result back to the inputs by construction. State-of-the-art claims are supported by external view-synthesis benchmarks, not by quantities defined internally in terms of the model's own outputs. This is the most common honest finding for an empirical scaling paper whose contributions are architectural and data-driven rather than mathematical derivations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption 3D can be regarded as a specialised sub-domain of video, expressed purely as a sequence-to-sequence image synthesis task
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We represent different positions as follows: In 2D image positions, pixel coordinates are mapped to a pair of angles (θh,θw) … In 3D camera poses, 6-DoF camera extrinsics c=[R t; 0 1] … For video data: g:=(θh,θw,θt)∈SO(2)×SO(2)×SO(2).
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Kaleido is a latent rectified flow model … Z_T_t = (1−t)z_T + tϵ … trained with a standard noise-prediction objective
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Evaluating Large Language Models Trained on Code
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024a. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jar...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Shap-E: Generating Conditional 3D Implicit Functions
Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions.arXiv preprint arXiv:2305.02463,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Jonas Kulhanek and Torsten Sattler. Nerfbaselines: Consistent and reproducible evaluation of novel view synthesis methods.arXiv preprint arXiv:2406.17345,
-
[6]
Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling
Haoran Li, Haolin Shi, Wenli Zhang, Wenjun Wu, Yong Liao, Lin Wang, Lik-hang Lee, and Peng Yuan Zhou. Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling. InProceedings of the European Conference on Computer Vision (ECCV), 2024a. Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relati...
-
[7]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan Camilo Perez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, et al. Mardini: Masked auto-regressive diffusion for video generation at scale.Transactions on Machine Laerning Research (TMLR), 2025a. Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Zexiang Xu, Hao Su, et al. One-2-3-45:...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
SyncDreamer: Generating Multiview-consistent Images from a Single-view Image
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InProceedings of the International Conference on Learning Representations (ICLR), 2023c. Xingchen Liu, Piyush Tayal, Jianyuan Wang, Jesus Zarzar, Tom Monnier, Konstantinos Tertikas, Jiali Duan, Antoine Toisoul, Jason Y Zhang, Nat...
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Point-E: A System for Generating 3D Point Clouds from Complex Prompts
Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Movie Gen: A Cast of Media Foundation Models
AdamPolyak,AmitZohar,AndrewBrown,AndrosTjandra,AnimeshSinha,AnnLee,ApoorvVyas,BowenShi,Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
DreamFusion: Text-to-3D using 2D Diffusion
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Code Llama: Open Foundation Models for Code
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[14]
Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model
Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view diffusion base model.arXiv preprint arXiv:2310.15110,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Rtmv: A ray-traced multi-view synthetic dataset for novel view synthesis
Jonathan Tremblay, Moustafa Meshry, Alex Evans, Jan Kautz, Alexander Keller, Sameh Khamis, Thomas Müller, Charles Loop, Nathan Morrical, Koki Nagano, et al. Rtmv: A ray-traced multi-view synthetic dataset for novel view synthesis. arXiv preprint arXiv:2205.07058,
-
[16]
Stage 1 (Video data) Stage 2 (3D data) Stage 3 (3D data) Stage 4 (3D data) [256×256] [256×256] [512×512] [1024mixed AR] Batch Size # Steps Batch Size # Steps Batch Size # Steps Batch Size # Steps Kaleido-Small1024 700K 1024 300K 256 100K 256 100K Kaleido-Medium1024 700K 1024 300K 256 100K 256 100K Kaleido2048 700K 2048 500K 256 100K 256 100K Table 5 Kalei...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.