arxiv: 2510.04236 · v3 · submitted 2025-10-05 · 💻 cs.CV

Scaling Sequence-to-Sequence Generative Neural Rendering

Shikun Liu , Kam Woh Ng , Wonbong Jang , Jiadong Guo , Junlin Han , Haozhe Liu , Yiannis Douratsos , Juan C. P\'erez

show 4 more authors

Zijian Zhou Chi Phung Tao Xiang Juan-Manuel P\'erez-R\'ua

This is my paper

Pith reviewed 2026-05-18 10:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords neural renderingview synthesissequence-to-sequencegenerative modelsrectified flowtransformervideo pretraining3D reconstruction

0 comments

The pith

A single sequence-to-sequence transformer pretrained on video data matches per-scene 3D optimization quality for view synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that three-dimensional neural rendering can be treated as a specialized form of video generation expressed purely as sequence-to-sequence image synthesis. This unification lets a decoder-only rectified flow transformer handle generative view synthesis for objects and scenes without explicit 3D representations or any architectural changes. Pretraining on large-scale video data then improves spatial consistency and reduces the need for scarce camera-labeled 3D datasets. If the approach holds, one model could generate consistent novel views from any number of reference images in both few-view and many-view regimes while outperforming other generative methods and rivaling specialized optimization techniques.

Core claim

Kaleido frames 3D as a sub-domain of video and performs photorealistic neural rendering through masked autoregressive sequence-to-sequence synthesis inside a decoder-only rectified flow transformer. The model generates any number of 6-DoF target views conditioned on any number of reference views, unifies 3D and video modeling in a single architecture, and uses video pretraining to achieve new state-of-the-art results on view synthesis benchmarks, with zero-shot performance that substantially exceeds other generative methods in few-view settings and matches per-scene optimization methods in many-view settings.

What carries the argument

A decoder-only rectified flow transformer that performs masked autoregressive sequence-to-sequence image synthesis, enabling unified generative view synthesis without explicit 3D representations.

Load-bearing premise

That the geometric structure and consistency of 3D scenes can be fully captured by treating them as ordinary image sequences in a video-style generative model without dedicated 3D components.

What would settle it

A controlled experiment on a standard view synthesis benchmark showing persistent geometric inconsistencies or lower quality than per-scene optimization methods even after video pretraining and with many input views available.

read the original abstract

We present Kaleido, a family of generative models designed for photorealistic, unified object- and scene-level neural rendering. Kaleido operates on the principle that 3D can be regarded as a specialised sub-domain of video, expressed purely as a sequence-to-sequence image synthesis task. Through a systemic study of scaling sequence-to-sequence generative neural rendering, we introduce key architectural innovations that enable our model to: i) perform generative view synthesis without explicit 3D representations; ii) generate any number of 6-DoF target views conditioned on any number of reference views via a masked autoregressive framework; and iii) seamlessly unify 3D and video modelling within a single decoder-only rectified flow transformer. Within this unified framework, Kaleido leverages large-scale video data for pre-training, which significantly improves spatial consistency and reduces reliance on scarce, camera-labelled 3D datasets -- all without any architectural modifications. Kaleido sets a new state-of-the-art on a range of view synthesis benchmarks. Its zero-shot performance substantially outperforms other generative methods in few-view settings, and, for the first time, matches the quality of per-scene optimisation methods in many-view settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Kaleido shows a video-pretrained decoder-only rectified flow transformer can do flexible masked autoregressive view synthesis and claims parity with per-scene methods, but geometric consistency without explicit 3D structure is the part that needs checking.

read the letter

Kaleido shows that a single decoder-only rectified flow transformer pre-trained on video can perform generative view synthesis for arbitrary numbers of input and output views via masked autoregression, and it claims to match per-scene optimization quality in many-view cases for the first time. The new element is the systemic scaling study of this sequence-to-sequence formulation for neural rendering. By framing 3D as a specialized video task, they leverage abundant video data to improve consistency while keeping the architecture the same. This lets the model handle both few-view and many-view scenarios in one framework, which is a useful practical step. The benchmark results indicate it beats other generative baselines in zero-shot few-view settings and reaches comparable quality to methods like NeRF when views are plentiful. The weaker link is whether this really delivers the geometric consistency needed without explicit 3D modeling. Video pre-training provides good priors but lacks calibrated 3D information, and generating views one by one might not enforce multi-view constraints as tightly as dedicated optimization methods. It would help to see more detailed analysis of cross-view errors beyond standard 2D metrics. This work is for computer vision researchers interested in generative approaches to rendering and 3D reconstruction. It offers value to those exploring unified models that cut down on specialized 3D data requirements. It deserves peer review. The results are concrete and the unification idea is worth a closer look from experts in the area.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Kaleido, a family of generative models for photorealistic unified object- and scene-level neural rendering. It treats 3D as a specialized sub-domain of video expressed purely as a sequence-to-sequence image synthesis task, implemented via a decoder-only rectified flow transformer with a masked autoregressive framework. The model performs generative view synthesis without explicit 3D representations, generates any number of 6-DoF target views from any number of reference views, and unifies 3D and video modeling. It leverages large-scale video pre-training to improve spatial consistency without architectural modifications and claims new state-of-the-art results on view synthesis benchmarks, with zero-shot performance outperforming other generative methods in few-view settings and, for the first time, matching per-scene optimization methods in many-view settings.

Significance. If the central claims hold, the work would be significant for demonstrating that a single unmodified decoder-only rectified flow transformer pretrained on video data can achieve high-quality neural rendering across 3D and video domains. This could reduce reliance on scarce camera-labeled 3D datasets, simplify architectures by avoiding explicit 3D representations, and advance unified generative modeling. The scaling study and reported parity with per-scene methods like NeRF/3DGS in many-view regimes would represent a notable contribution if supported by rigorous controls for geometric consistency.

major comments (2)

[Abstract] Abstract: The claim that Kaleido 'for the first time, matches the quality of per-scene optimisation methods in many-view settings' is central to the contribution. This depends on the video-pretrained sequence model implicitly capturing accurate multi-view geometry and cross-view consistency without explicit 3D structure or architectural changes. Standard 2D metrics (PSNR/SSIM) can mask inconsistencies; the manuscript should provide explicit multi-view consistency analysis or 3D-aware metrics to substantiate parity with NeRF/3DGS.
[Abstract] Abstract: The text simultaneously highlights 'key architectural innovations' that enable the unified framework and states that video pre-training occurs 'without any architectural modifications.' This tension is load-bearing for the seamless-unification premise and requires clarification on the precise nature of any innovations versus the unmodified decoder-only rectified flow transformer.

minor comments (2)

[Abstract] Abstract: Define 'few-view' and 'many-view' settings quantitatively (e.g., number of reference views) and list the specific benchmarks and baselines used to support the SOTA and parity claims.
[Methods] The masked autoregressive generation procedure for arbitrary numbers of 6-DoF views would benefit from a dedicated methods subsection with pseudocode or a clear conditioning diagram.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the presentation of our claims without altering the core technical contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that Kaleido 'for the first time, matches the quality of per-scene optimisation methods in many-view settings' is central to the contribution. This depends on the video-pretrained sequence model implicitly capturing accurate multi-view geometry and cross-view consistency without explicit 3D structure or architectural changes. Standard 2D metrics (PSNR/SSIM) can mask inconsistencies; the manuscript should provide explicit multi-view consistency analysis or 3D-aware metrics to substantiate parity with NeRF/3DGS.

Authors: We appreciate the referee's emphasis on rigorous validation of geometric consistency. While PSNR and SSIM remain the community-standard metrics for direct comparison against per-scene optimization baselines such as NeRF and 3DGS (as used in prior generative view synthesis works), we acknowledge that these image-level metrics alone can obscure certain multi-view inconsistencies. Our manuscript already includes extensive qualitative visualizations across diverse scenes demonstrating cross-view coherence, and the scaling experiments show that video pre-training measurably improves spatial consistency. To further substantiate the parity claim, we will add a dedicated subsection in the Experiments section reporting additional 3D-aware analyses, including average cross-view optical flow consistency and reprojection error statistics computed on held-out views where camera poses permit. These additions will be included in the revised manuscript. revision: yes
Referee: [Abstract] Abstract: The text simultaneously highlights 'key architectural innovations' that enable the unified framework and states that video pre-training occurs 'without any architectural modifications.' This tension is load-bearing for the seamless-unification premise and requires clarification on the precise nature of any innovations versus the unmodified decoder-only rectified flow transformer.

Authors: We thank the referee for identifying this potential source of confusion. The phrase 'key architectural innovations' in the abstract refers to the high-level methodological contributions: (i) the sequence-to-sequence formulation that treats 3D view synthesis as a specialized video modeling task, (ii) the masked autoregressive generation mechanism enabling arbitrary numbers of 6-DoF target views from arbitrary reference views, and (iii) the unified training paradigm that allows a single model to handle both 3D and video data. These are innovations in task formulation, data representation, and training strategy. Critically, the underlying model remains an unmodified decoder-only rectified flow transformer; no architectural changes (such as custom 3D positional encodings, added geometric layers, or modifications to the attention or flow mechanisms) are introduced. Video pre-training is performed directly on this standard architecture. We will revise the abstract wording to explicitly distinguish between these methodological innovations and the absence of modifications to the transformer backbone itself. revision: yes

Circularity Check

0 steps flagged

No circularity: unification premise is an explicit modeling choice, performance claims rest on external benchmarks

full rationale

The paper's central premise—that 3D can be treated as a specialized sub-domain of video and modeled as sequence-to-sequence image synthesis in an unmodified decoder-only rectified flow transformer—is presented as a design decision rather than a derived result. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or uniqueness result back to the inputs by construction. State-of-the-art claims are supported by external view-synthesis benchmarks, not by quantities defined internally in terms of the model's own outputs. This is the most common honest finding for an empirical scaling paper whose contributions are architectural and data-driven rather than mathematical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that video pre-training transfers directly to 3D view synthesis without architectural changes; no free parameters or invented entities are identifiable from the abstract.

axioms (1)

domain assumption 3D can be regarded as a specialised sub-domain of video, expressed purely as a sequence-to-sequence image synthesis task
Explicitly stated as the operating principle of Kaleido in the abstract.

pith-pipeline@v0.9.0 · 5783 in / 1286 out tokens · 32144 ms · 2026-05-18T10:05:35.029249+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We represent different positions as follows: In 2D image positions, pixel coordinates are mapped to a pair of angles (θh,θw) … In 3D camera poses, 6-DoF camera extrinsics c=[R t; 0 1] … For video data: g:=(θh,θw,θt)∈SO(2)×SO(2)×SO(2).
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Kaleido is a latent rectified flow model … Z_T_t = (1−t)z_T + tϵ … trained with a standard noise-prediction objective

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 12 internal anchors

[1]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Evaluating Large Language Models Trained on Code

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024a. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jar...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Shap-E: Generating Conditional 3D Implicit Functions

Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions.arXiv preprint arXiv:2305.02463,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Nerfbaselines: Consistent and reproducible evaluation of novel view synthesis methods.arXiv preprint arXiv:2406.17345,

Jonas Kulhanek and Torsten Sattler. Nerfbaselines: Consistent and reproducible evaluation of novel view synthesis methods.arXiv preprint arXiv:2406.17345,

work page arXiv
[6]

Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling

Haoran Li, Haolin Shi, Wenli Zhang, Wenjun Wu, Yong Liao, Lin Wang, Lik-hang Lee, and Peng Yuan Zhou. Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling. InProceedings of the European Conference on Computer Vision (ECCV), 2024a. Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relati...

work page arXiv
[7]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan Camilo Perez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, et al. Mardini: Masked auto-regressive diffusion for video generation at scale.Transactions on Machine Laerning Research (TMLR), 2025a. Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Zexiang Xu, Hao Su, et al. One-2-3-45:...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InProceedings of the International Conference on Learning Representations (ICLR), 2023c. Xingchen Liu, Piyush Tayal, Jianyuan Wang, Jesus Zarzar, Tom Monnier, Konstantinos Tertikas, Jiali Duan, Antoine Toisoul, Jason Y Zhang, Nat...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Movie Gen: A Cast of Media Foundation Models

AdamPolyak,AmitZohar,AndrewBrown,AndrosTjandra,AnimeshSinha,AnnLee,ApoorvVyas,BowenShi,Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[14]

Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view diffusion base model.arXiv preprint arXiv:2310.15110,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Rtmv: A ray-traced multi-view synthetic dataset for novel view synthesis

Jonathan Tremblay, Moustafa Meshry, Alex Evans, Jan Kautz, Alexander Keller, Sameh Khamis, Thomas Müller, Charles Loop, Nathan Morrical, Koki Nagano, et al. Rtmv: A ray-traced multi-view synthetic dataset for novel view synthesis. arXiv preprint arXiv:2205.07058,

work page arXiv
[16]

Stage 1 (Video data) Stage 2 (3D data) Stage 3 (3D data) Stage 4 (3D data) [256×256] [256×256] [512×512] [1024mixed AR] Batch Size # Steps Batch Size # Steps Batch Size # Steps Batch Size # Steps Kaleido-Small1024 700K 1024 300K 256 100K 256 100K Kaleido-Medium1024 700K 1024 300K 256 100K 256 100K Kaleido2048 700K 2048 500K 256 100K 256 100K Table 5 Kalei...

work page 2048