TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization

Weiliang Chen; Xuebo Wang; Yuanhui Huang; Yueqi Duan

arxiv: 2606.17590 · v1 · pith:DBPEFAZPnew · submitted 2026-06-16 · 💻 cs.CV

TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization

Weiliang Chen , Yuanhui Huang , Xuebo Wang , Yueqi Duan This is my paper

Pith reviewed 2026-06-27 01:17 UTC · model grok-4.3

classification 💻 cs.CV

keywords video tokenizationtime-invariant tokenstime-variant tokensScope-Induced FactorizationInvariant Broadcastingvideo compressionscalable video generation

0 comments

The pith

TivTok encodes persistent video content once in reusable time-invariant tokens and frame-specific changes in variant tokens to cut total token usage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TivTok as a tokenizer that splits each video clip into time-invariant tokens capturing content shared across all frames and time-variant tokens holding only the residuals unique to each frame. Scope-Induced Factorization enforces this split by giving invariant tokens attention over the entire clip while restricting variant tokens to their own frame plus the invariants. Invariant Broadcasting then reuses the same invariant tokens for every frame and across chunks during decoding. This reuse directly lowers the token count needed for reconstruction, which the authors show improves efficiency especially on longer sequences that contain static elements.

Core claim

TivTok represents a clip with Time-Invariant (TIV) tokens that encode information shared across frames and Time-Variant (TV) tokens that encode frame-specific residuals. Scope-Induced Factorization assigns different attention scopes to the two groups so TIV tokens attend to the full clip while each TV token accesses only its frame together with the TIV tokens. In the decoder, Invariant Broadcasting reuses the same TIV tokens across frames and chunks for parallel reconstruction and long-video tokenization.

What carries the argument

Scope-Induced Factorization (SIF) that enforces full-clip attention for TIV tokens versus frame-limited attention for TV tokens, together with Invariant Broadcasting (IB) that reuses TIV tokens across time.

If this is right

Achieves an rFVD of 12.65 on the 16×256×256 benchmark
Delivers 2.91× higher compression efficiency on 128-frame videos versus evaluated baselines
Uses only 1.1% of the tokens required by downsample-based tokenizers
Enables parallel reconstruction and tokenization of long videos through cross-chunk reuse of invariant tokens

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same factorization principle could be tested on other sequential data such as audio waveforms or motion capture where some features remain stable over time.
Generation models built on top of TivTok tokens might allocate separate capacity to modeling invariants versus variants, potentially improving coherence over very long outputs.
If the reuse ratio holds at larger scales, training and inference costs for video models could drop roughly in proportion to the reported token reduction.

Load-bearing premise

Persistent content such as static backgrounds and consistent object appearances can be effectively encoded in time-invariant tokens reusable across frames and chunks without substantial reconstruction loss.

What would settle it

Reconstruction quality measured on videos whose backgrounds or object appearances change substantially between frames; if error rises markedly above that of non-reuse baselines, the factorization benefit disappears.

read the original abstract

Video tokenization is fundamental to scalable video generation, as the number of tokens directly determines the computational cost and the length of videos that can be modeled. Existing tokenizers mainly improve scalability by compressing videos into fewer tokens, but they often continue to represent persistent content, such as static backgrounds and consistent object appearances, repeatedly across frames and chunks. In this paper, we propose \textbf{TivTok} (\textit{Time-Invariant Tokenizer}), a reuse-aware video tokenizer that makes persistent information reusable across time. TivTok represents a clip with Time-Invariant (TIV) tokens that encode information shared across frames and Time-Variant (TV) tokens that encode frame-specific residuals. To obtain this factorization, we introduce Scope-Induced Factorization (SIF), which assigns different attention scopes to the two token groups: TIV tokens attend to the full clip, whereas each TV token only accesses its corresponding frame together with the TIV tokens. In the decoder, Invariant Broadcasting (IB) reuses the same TIV tokens across frames and chunks for parallel reconstruction and long-video tokenization. Experiments show that TivTok achieves an rFVD of 12.65 on the standard $16{\times}256{\times}256$ benchmark and improves compression efficiency by 2.91$\times$ for 128-frame videos compared with the evaluated baselines, while using only 1.1\% of the tokens required by downsample-based tokenizers in our evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TivTok's TIV/TV split with scope-induced attention and broadcasting cuts token counts for long videos in a clean architectural way, but the gains rest on unexamined experimental details.

read the letter

The one or two things to know are that TivTok factors clips into time-invariant tokens reused across frames via broadcasting and time-variant residuals, which directly targets repeated encoding of static content and reports strong efficiency numbers on standard benchmarks.

The paper does a solid job laying out the problem of token explosion in video models and proposing Scope-Induced Factorization to enforce the split through attention scopes—TIV tokens see the full clip while TV tokens stay frame-local—plus Invariant Broadcasting for decoder reuse and chunk handling. This is a genuine architectural move rather than another compression heuristic, and the claimed results (rFVD of 12.65 on 16×256×256, 2.91× efficiency on 128-frame videos, down to 1.1% of downsample token counts) are concrete enough to be worth checking.

The soft spots sit in the evaluation. The abstract states the quantitative wins but gives no training procedure, baseline reimplementation details, or breakdown on dynamic scenes where the invariant assumption could cost reconstruction quality. Without those, it is hard to know how much the gains depend on dataset specifics or whether the separation holds cleanly in practice. The central assumption—that persistent backgrounds and appearances can be isolated into reusable TIV tokens without substantial loss—is reasonable but untested in the provided material.

This is for people working on video tokenizers and long-sequence generation who need lower token budgets. A reader focused on attention scoping or reuse tricks would get usable ideas from the method even if they adapt it.

It deserves a serious referee because the core construction is distinct from prior work and the efficiency claims are specific enough to repay detailed checking.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes TivTok, a reuse-aware video tokenizer that factorizes each clip into Time-Invariant (TIV) tokens capturing persistent content across frames and Time-Variant (TV) tokens capturing frame-specific residuals. Scope-Induced Factorization (SIF) enforces this split by giving TIV tokens full-clip attention scope while restricting each TV token to its own frame plus the TIV tokens; Invariant Broadcasting (IB) then reuses the same TIV tokens across frames and chunks during decoding. On the standard 16×256×256 benchmark the method reports an rFVD of 12.65, a 2.91× compression-efficiency gain for 128-frame videos relative to evaluated baselines, and a token count equal to only 1.1 % of that required by downsample-based tokenizers.

Significance. If the reported metrics are reproducible, the work would meaningfully advance scalable video generation by directly attacking the redundancy of repeatedly tokenizing static backgrounds and consistent object appearances. The SIF/IB construction offers a clean architectural mechanism for token reuse that could extend to longer videos without proportional growth in token budget.

major comments (1)

Abstract: the central quantitative claims (rFVD = 12.65, 2.91× efficiency, 1.1 % token count) are presented without any description of training procedure, baseline implementations, dataset splits, or error analysis on dynamic content; these omissions are load-bearing for the scalability assertions that rest on the SIF/IB factorization working as described.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed feedback. We address the concern about the abstract below, noting that abstracts are necessarily concise while the full experimental details appear in the manuscript body.

read point-by-point responses

Referee: [—] Abstract: the central quantitative claims (rFVD = 12.65, 2.91× efficiency, 1.1 % token count) are presented without any description of training procedure, baseline implementations, dataset splits, or error analysis on dynamic content; these omissions are load-bearing for the scalability assertions that rest on the SIF/IB factorization working as described.

Authors: We agree that the abstract is brief by design and omits procedural details. The manuscript provides these in full: training procedure and hyperparameters are described in Section 4.1, baseline implementations and tokenization comparisons in Section 4.2, dataset splits and benchmark construction in Section 4.3, and error analysis on dynamic versus static content (including per-category rFVD breakdowns) in Section 5.3 and the supplementary material. The reported rFVD of 12.65 follows the standard 16×256×256 protocol used by prior work; the 2.91× efficiency gain and 1.1 % token count are computed directly from the token budgets and reconstruction metrics on 128-frame clips. We believe these sections substantiate the scalability claims for the SIF/IB mechanism. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents TivTok as an architectural innovation using Scope-Induced Factorization (SIF) to separate time-invariant and time-variant tokens, followed by Invariant Broadcasting (IB) for reuse. Reported metrics (rFVD 12.65, 2.91× efficiency, 1.1% token count) are empirical outcomes on standard benchmarks rather than derivations or predictions that reduce to fitted inputs or self-definitions by construction. No equations, self-citation chains, or uniqueness theorems are exhibited in the supplied material that would force the central claims to be equivalent to their own inputs. The method is self-contained as a proposed tokenizer design evaluated experimentally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract-only review limits visibility into parameters or assumptions; core idea rests on domain assumption of separable persistent content.

axioms (1)

domain assumption Videos contain substantial persistent content that can be separated into time-invariant and time-variant components without major loss
Foundational premise for SIF and IB to work as described.

invented entities (2)

Time-Invariant (TIV) tokens no independent evidence
purpose: Encode information shared across frames for reuse
New token category introduced by the method
Time-Variant (TV) tokens no independent evidence
purpose: Encode frame-specific residuals
New token category introduced by the method

pith-pipeline@v0.9.1-grok · 5801 in / 1224 out tokens · 44893 ms · 2026-06-27T01:17:46.181804+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

129 extracted references · 5 canonical work pages

[1]

(2025) Cosmos world foundation model platform for physical ai

Agarwal N, Ali A, Bala M, Balaji Y, Barker E, Cai T, Chattopadhyay P, Chen Y, Cui Y, Ding Y, et al. (2025) Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:250103575

2025
[2]

In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1728--1738

Bain M, Nagrani A, Varol G, Zisserman A (2021) Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1728--1738

2021
[3]

(2023 a ) Stable video diffusion: Scaling latent video diffusion models to large datasets

Blattmann A, Dockhorn T, Kulal S, Mendelevitch D, Kilian M, Lorenz D, Levi Y, English Z, Voleti V, Letts A, et al. (2023 a ) Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:231115127

2023
[4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22563--22575

Blattmann A, Rombach R, Ling H, Dockhorn T, Kim SW, Fidler S, Kreis K (2023 b ) Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22563--22575

2023
[5]

arXiv preprint arXiv:180801340

Carreira J, Noland E, Banki-Horvath A, Hillier C, Zisserman A (2018) A short note about kinetics-600. arXiv preprint arXiv:180801340

2018
[6]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 28358--28370

Chen H, Wang Z, Li X, Sun X, Chen F, Liu J, Wang J, Raj B, Liu Z, Barsoum E (2025 a ) Softvq-vae: Efficient 1-dimensional continuous tokenizer. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 28358--28370

2025
[7]

arXiv preprint arXiv:240901199

Chen L, Li Z, Lin B, Zhu B, Wang Q, Yuan S, Zhou X, Cheng X, Yuan L (2024 a ) Od-vae: An omni-dimensional video compressor for improving latent video diffusion model. arXiv preprint arXiv:240901199

2024
[8]

arXiv preprint arXiv:240812601

Chen W, Liu F, Wu D, Sun H, Lu J, Duan Y (2024 b ) Dreamcinema: Cinematic transfer with free camera and 3d character. arXiv preprint arXiv:240812601

2024
[9]

arXiv preprint arXiv:250610981

Chen W, Bi J, Huang Y, Zheng W, Duan Y (2025 b ) Scenecompleter: Dense 3d scene completion for generative novel view synthesis. arXiv preprint arXiv:250610981

2025
[10]

The Alliance for Open Media 1:2

De Rivaz P, Haughton J (2019) Av1 bitstream & decoding process specification. The Alliance for Open Media 1:2

2019
[11]

Communications of the ACM 63(11):139--144

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Communications of the ACM 63(11):139--144

2020
[12]

(2024) Ltx-video: Realtime video latent diffusion

HaCohen Y, Chiprut N, Brazowski B, Shalem D, Moshe D, Richardson E, Levin E, Shiran G, Zabari N, Gordon O, et al. (2024) Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:250100103

2024
[13]

arXiv preprint arXiv:250618899

Huang K, Huang Y, Wang X, Lin Z, Ning X, Wan P, Zhang D, Wang Y, Liu X (2025 a ) Filmaster: Bridging cinematic principles and generative ai for automated film generation. arXiv preprint arXiv:250618899

2025
[14]

arXiv preprint arXiv:241209600

Huang Y, Zheng W, Gao Y, Tao X, Wan P, Zhang D, Zhou J, Lu J (2024) Owl-1: Omni world model for consistent long video generation. arXiv preprint arXiv:241209600

2024
[15]

arXiv preprint arXiv:250610962

Huang Y, Chen W, Zheng W, Duan Y, Zhou J, Lu J (2025 b ) Spectralar: Spectral autoregressive visual generation. arXiv preprint arXiv:250610962

2025
[16]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 22853--22863

Jang H, Yu S, Shin J, Abbeel P, Seo Y (2025) Efficient long video tokenization via coordinate-based patch reconstruction. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 22853--22863

2025
[17]

(2024) Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization

Jin Y, Sun Z, Xu K, Chen L, Jiang H, Huang Q, Song C, Liu Y, Zhang D, Song Y, et al. (2024) Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization. arXiv preprint arXiv:240203161

2024
[18]

In: European conference on computer vision, Springer, pp 694--711

Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision, Springer, pp 694--711

2016
[19]

In: European Conference on Computer Vision, Springer, pp 148--165

Kim K, Lee H, Park J, Kim S, Lee K, Kim S, Yoo J (2024) Hybrid video diffusion models with 2d triplane and 3d wavelet representation. In: European Conference on Computer Vision, Springer, pp 148--165

2024
[20]

In: International conference on machine learning, PMLR, pp 1558--1566

Larsen ABL, S nderby SK, Larochelle H, Winther O (2016) Autoencoding beyond pixels using a learned similarity metric. In: International conference on machine learning, PMLR, pp 1558--1566

2016
[21]

arXiv preprint arXiv:250517011

Li Y, Tian C, Xia R, Liao N, Guo W, Yan J, Li H, Dai J, Li H, Yang X (2025) Learning adaptive and temporally causal video tokenization in a 1d latent space. arXiv preprint arXiv:250517011

2025
[22]

(2024) Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding

Li Z, Zhang J, Lin Q, Xiong J, Long Y, Deng X, Zhang Y, Liu X, Huang M, Xiao Z, et al. (2024) Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding. arXiv preprint arXiv:240508748

2024
[23]

In: European Conference on Computer Vision, Springer, pp 389--406

Liu F, Wang H, Chen W, Sun H, Duan Y (2024) Make-your-3d: Fast and consistent subject-driven 3d content generation. In: European Conference on Computer Vision, Springer, pp 389--406

2024
[24]

arXiv preprint arXiv:250607136

Liu H, Sun W, Zhang Q, Di D, Gong B, Li H, Wei C, Zou C (2025) Hi-vae: Efficient video autoencoding with global and detailed motion. arXiv preprint arXiv:250607136

2025
[25]

In: European Conference on Computer Vision, Springer, pp 23--40

Ma N, Goldstein M, Albergo MS, Boffi NM, Vanden-Eijnden E, Xie S (2024) Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In: European Conference on Computer Vision, Springer, pp 23--40

2024
[26]

arXiv preprint arXiv:230701952

Podell D, English Z, Lacey K, Blattmann A, Dockhorn T, M \"u ller J, Penna J, Rombach R (2023) Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:230701952

2023
[27]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4209--4219

Ren X, Huang J, Zeng X, Museth K, Fidler S, Williams F (2024 a ) Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4209--4219

2024
[28]

Advances in Neural Information Processing Systems 37:97670--97698

Ren X, Lu Y, Liang H, Wu Z, Ling H, Chen M, Fidler S, Williams F, Huang J (2024 b ) Scube: Instant large-scale scene reconstruction using voxsplats. Advances in Neural Information Processing Systems 37:97670--97698

2024
[29]

264 and MPEG-4 video compression: video coding for next-generation multimedia

Richardson IE (2004) H. 264 and MPEG-4 video compression: video coding for next-generation multimedia. John Wiley & Sons

2004
[30]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10684--10695

Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B (2022) High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10684--10695

2022
[31]

International Journal of Computer Vision 134(6), doi:10.1007/s11263-026-02901-4

Shu Y, Qiu Z, Yao T, Mei T (2026) Guidedvdm: Controllable video generation with long-term consistency. International Journal of Computer Vision 134(6), doi:10.1007/s11263-026-02901-4

work page doi:10.1007/s11263-026-02901-4 2026
[32]

arXiv preprint arXiv:12120402

Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:12120402

2012
[33]

arXiv preprint arXiv:241210443

Tan Z, Xue B, Jia J, Wang J, Ye W, Shi S, Sun M, Wu W, Chen Q, Jiang P (2024) Sweettok: Semantic-aware spatial-temporal tokenizer for compact video discretization. arXiv preprint arXiv:241210443

2024
[34]

arXiv preprint arXiv:241213061

Tang A, He T, Guo J, Cheng X, Song L, Bian J (2024) Vidtok: A versatile and open-source video tokenizer. arXiv preprint arXiv:241213061

2024
[35]

Advances in neural information processing systems 37:84839--84865

Tian K, Jiang Y, Yuan Z, Peng B, Wang L (2024 a ) Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems 37:84839--84865

2024
[36]

arXiv preprint arXiv:241113552

Tian R, Dai Q, Bao J, Qiu K, Yang Y, Luo C, Wu Z, Jiang YG (2024 b ) Reducio! generating 1k video within 16 seconds using extremely compressed motion latents. arXiv preprint arXiv:241113552

2024
[37]

arXiv preprint arXiv:181201717

Unterthiner T, Van Steenkiste S, Kurach K, Marinier R, Michalski M, Gelly S (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717

2018
[38]

arXiv preprint arXiv:241021264

Wang H, Suri S, Ren Y, Chen H, Shrivastava A (2024 a ) Larp: Tokenizing videos with a learned autoregressive generative prior. arXiv preprint arXiv:241021264

2024
[39]

Advances in Neural Information Processing Systems 37:28281--28295

Wang J, Jiang Y, Yuan Z, Peng B, Wu Z, Jiang YG (2024 b ) Omnitokenizer: A joint image-video tokenizer for visual generation. Advances in Neural Information Processing Systems 37:28281--28295

2024
[40]

International Journal of Computer Vision 134(1), doi:10.1007/s11263-025-02604-2

Wang S, Shen L, Xiao J, Tian Z, Wang F, Hu X, Zhu Y, Feng G (2026) Breaking redundancy via 3d sparse geometry: 3d-aware neural compression for multi-view videos. International Journal of Computer Vision 134(1), doi:10.1007/s11263-025-02604-2

work page doi:10.1007/s11263-025-02604-2 2026
[41]

Advances in Neural Information Processing Systems 37:65618--65642

Wang W, Yang Y (2024) Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Advances in Neural Information Processing Systems 37:65618--65642

2024
[42]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 22922--22932

Wang Y, Guo J, Xie X, He T, Sun X, Bian J (2025) Vidtwin: Video vae with decoupled structure and dynamics. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 22922--22932

2025
[43]

IEEE transactions on image processing 13(4):600--612

Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4):600--612

2004
[44]

Advances in Neural Information Processing Systems 37:68082--68119

Wu J, Yin S, Feng N, He X, Li D, Hao J, Long M (2024) ivideogpt: Interactive videogpts are scalable world models. Advances in Neural Information Processing Systems 37:68082--68119

2024
[45]

International Journal of Computer Vision 131(10):2699--2722, doi:10.1007/s11263-023-01832-8

Xu X, Wang Y, Wang L, Yu B, Jia J (2023) Conditional temporal variational autoencoder for action video prediction. International Journal of Computer Vision 131(10):2699--2722, doi:10.1007/s11263-023-01832-8

work page doi:10.1007/s11263-023-01832-8 2023
[46]

arXiv preprint arXiv:241008368

Yan W, Mnih V, Faust A, Zaharia M, Abbeel P, Liu H (2024) Elastictok: Adaptive tokenization for image and video. arXiv preprint arXiv:241008368

2024
[47]

International Journal of Computer Vision 129(5):1451--1466, doi:10.1007/s11263-020-01429-5

Yang C, Shen Y, Zhou B (2021) Semantic hierarchy emerges in deep generative representations for scene synthesis. International Journal of Computer Vision 129(5):1451--1466, doi:10.1007/s11263-020-01429-5

work page doi:10.1007/s11263-020-01429-5 2021
[48]

generation: Taming optimization dilemma in latent diffusion models

Yao J, Yang B, Wang X (2025) Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 15703--15712

2025
[49]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 22888--22897

Yoo J, Kim S, Lee D, Kim C, Hong S (2023) Towards end-to-end generative modeling of long videos with memory-efficient bidirectional transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 22888--22897

2023
[50]

Advances in Neural Information Processing Systems 37:128940--128966

Yu Q, Weber M, Deng X, Shen X, Cremers D, Chen LC (2024 a ) An image is worth 32 tokens for reconstruction and generation. Advances in Neural Information Processing Systems 37:128940--128966

2024
[51]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18456--18466

Yu S, Sohn K, Kim S, Shin J (2023) Video probabilistic diffusion models in projected latent space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18456--18466

2023
[52]

arXiv preprint arXiv:240314148

Yu S, Nie W, Huang DA, Li B, Shin J, Anandkumar A (2024 b ) Efficient video diffusion models via content-frame motion-latent decomposition. arXiv preprint arXiv:240314148

2024
[53]

arXiv preprint arXiv:250212632

Yu S, Hahn M, Kondratyuk D, Shin J, Gupta A, Lezama J, Essa I, Ross D, Huang J (2025) Malt diffusion: Memory-augmented latent transformers for any-length video generation. arXiv preprint arXiv:250212632

2025
[54]

In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586--595

Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586--595

2018
[55]

International Journal of Computer Vision 133(7):4909--4922, doi:10.1007/s11263-025-02413-7

Zhang Y, Wang X, Chen H, Qin C, Hao Y, Mei H, Zhu W (2025) Scenariodiff: Text-to-video generation with dynamic transformations of scene conditions. International Journal of Computer Vision 133(7):4909--4922, doi:10.1007/s11263-025-02413-7

work page doi:10.1007/s11263-025-02413-7 2025
[56]

Advances in Neural Information Processing Systems 37:12847--12871

Zhao S, Zhang Y, Cun X, Yang S, Niu M, Li X, Hu W, Shan Y (2024) Cv-vae: A compatible video vae for latent generative video models. Advances in Neural Information Processing Systems 37:12847--12871

2024
[57]

In: European conference on computer vision, Springer, pp 55--72

Zheng W, Chen W, Huang Y, Zhang B, Duan Y, Lu J (2024 a ) Occworld: Learning a 3d occupancy world model for autonomous driving. In: European conference on computer vision, Springer, pp 55--72

2024
[58]

arXiv preprint arXiv:241220404

Zheng Z, Peng X, Yang T, Shen C, Li S, Liu H, Zhou Y, Li T, You Y (2024 b ) Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:241220404

2024
[59]

FirstName Alpher , title =
[60]

Journal of Foo , volume = 13, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =
[61]

Journal of Foo , volume = 14, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =
[62]

FirstName Alpher and FirstName Gamow , title =
[63]

Computer Vision -- ECCV 2022 , year =

2022
[64]

2025 , url=

Zhong, Tianxiong and Tian, Xingye and Jiang, Boyuan and Wang, Xuebo and Tao, Xin and Wan, Pengfei and Zhang, Zhiwei , booktitle=. 2025 , url=

2025
[65]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[66]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[67]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[68]

European Conference on Computer Vision , pages=

Make-your-3d: Fast and consistent subject-driven 3d content generation , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[69]

arXiv preprint arXiv:2412.09600 , year=

Owl-1: Omni world model for consistent long video generation , author=. arXiv preprint arXiv:2412.09600 , year=

arXiv
[70]

arXiv preprint arXiv:2408.12601 , year=

Dreamcinema: Cinematic transfer with free camera and 3d character , author=. arXiv preprint arXiv:2408.12601 , year=

arXiv
[71]

European conference on computer vision , pages=

Occworld: Learning a 3d occupancy world model for autonomous driving , author=. European conference on computer vision , pages=. 2024 , organization=

2024
[72]

arXiv preprint arXiv:2506.10981 , year=

SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis , author=. arXiv preprint arXiv:2506.10981 , year=

Pith/arXiv arXiv
[73]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[74]

arXiv preprint arXiv:2311.15127 , year=

Stable video diffusion: Scaling latent video diffusion models to large datasets , author=. arXiv preprint arXiv:2311.15127 , year=

Pith/arXiv arXiv
[75]

International Journal of Computer Vision , volume=

ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions , author=. International Journal of Computer Vision , volume=. 2025 , doi=

2025
[76]

International Journal of Computer Vision , volume=

GuidedVDM: Controllable Video Generation with Long-Term Consistency , author=. International Journal of Computer Vision , volume=. 2026 , doi=

2026
[77]

International Journal of Computer Vision , volume=

Semantic Hierarchy Emerges in Deep Generative Representations for Scene Synthesis , author=. International Journal of Computer Vision , volume=. 2021 , doi=

2021
[78]

International Journal of Computer Vision , volume=

Conditional Temporal Variational AutoEncoder for Action Video Prediction , author=. International Journal of Computer Vision , volume=. 2023 , doi=

2023
[79]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Animate anyone: Consistent and controllable image-to-video synthesis for character animation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[80]

arXiv preprint arXiv:2506.10962 , year=

SpectralAR: Spectral Autoregressive Visual Generation , author=. arXiv preprint arXiv:2506.10962 , year=

arXiv

Showing first 80 references.

[1] [1]

(2025) Cosmos world foundation model platform for physical ai

Agarwal N, Ali A, Bala M, Balaji Y, Barker E, Cai T, Chattopadhyay P, Chen Y, Cui Y, Ding Y, et al. (2025) Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:250103575

2025

[2] [2]

In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1728--1738

Bain M, Nagrani A, Varol G, Zisserman A (2021) Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1728--1738

2021

[3] [3]

(2023 a ) Stable video diffusion: Scaling latent video diffusion models to large datasets

Blattmann A, Dockhorn T, Kulal S, Mendelevitch D, Kilian M, Lorenz D, Levi Y, English Z, Voleti V, Letts A, et al. (2023 a ) Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:231115127

2023

[4] [4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22563--22575

Blattmann A, Rombach R, Ling H, Dockhorn T, Kim SW, Fidler S, Kreis K (2023 b ) Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22563--22575

2023

[5] [5]

arXiv preprint arXiv:180801340

Carreira J, Noland E, Banki-Horvath A, Hillier C, Zisserman A (2018) A short note about kinetics-600. arXiv preprint arXiv:180801340

2018

[6] [6]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 28358--28370

Chen H, Wang Z, Li X, Sun X, Chen F, Liu J, Wang J, Raj B, Liu Z, Barsoum E (2025 a ) Softvq-vae: Efficient 1-dimensional continuous tokenizer. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 28358--28370

2025

[7] [7]

arXiv preprint arXiv:240901199

Chen L, Li Z, Lin B, Zhu B, Wang Q, Yuan S, Zhou X, Cheng X, Yuan L (2024 a ) Od-vae: An omni-dimensional video compressor for improving latent video diffusion model. arXiv preprint arXiv:240901199

2024

[8] [8]

arXiv preprint arXiv:240812601

Chen W, Liu F, Wu D, Sun H, Lu J, Duan Y (2024 b ) Dreamcinema: Cinematic transfer with free camera and 3d character. arXiv preprint arXiv:240812601

2024

[9] [9]

arXiv preprint arXiv:250610981

Chen W, Bi J, Huang Y, Zheng W, Duan Y (2025 b ) Scenecompleter: Dense 3d scene completion for generative novel view synthesis. arXiv preprint arXiv:250610981

2025

[10] [10]

The Alliance for Open Media 1:2

De Rivaz P, Haughton J (2019) Av1 bitstream & decoding process specification. The Alliance for Open Media 1:2

2019

[11] [11]

Communications of the ACM 63(11):139--144

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Communications of the ACM 63(11):139--144

2020

[12] [12]

(2024) Ltx-video: Realtime video latent diffusion

HaCohen Y, Chiprut N, Brazowski B, Shalem D, Moshe D, Richardson E, Levin E, Shiran G, Zabari N, Gordon O, et al. (2024) Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:250100103

2024

[13] [13]

arXiv preprint arXiv:250618899

Huang K, Huang Y, Wang X, Lin Z, Ning X, Wan P, Zhang D, Wang Y, Liu X (2025 a ) Filmaster: Bridging cinematic principles and generative ai for automated film generation. arXiv preprint arXiv:250618899

2025

[14] [14]

arXiv preprint arXiv:241209600

Huang Y, Zheng W, Gao Y, Tao X, Wan P, Zhang D, Zhou J, Lu J (2024) Owl-1: Omni world model for consistent long video generation. arXiv preprint arXiv:241209600

2024

[15] [15]

arXiv preprint arXiv:250610962

Huang Y, Chen W, Zheng W, Duan Y, Zhou J, Lu J (2025 b ) Spectralar: Spectral autoregressive visual generation. arXiv preprint arXiv:250610962

2025

[16] [16]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 22853--22863

Jang H, Yu S, Shin J, Abbeel P, Seo Y (2025) Efficient long video tokenization via coordinate-based patch reconstruction. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 22853--22863

2025

[17] [17]

(2024) Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization

Jin Y, Sun Z, Xu K, Chen L, Jiang H, Huang Q, Song C, Liu Y, Zhang D, Song Y, et al. (2024) Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization. arXiv preprint arXiv:240203161

2024

[18] [18]

In: European conference on computer vision, Springer, pp 694--711

Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision, Springer, pp 694--711

2016

[19] [19]

In: European Conference on Computer Vision, Springer, pp 148--165

Kim K, Lee H, Park J, Kim S, Lee K, Kim S, Yoo J (2024) Hybrid video diffusion models with 2d triplane and 3d wavelet representation. In: European Conference on Computer Vision, Springer, pp 148--165

2024

[20] [20]

In: International conference on machine learning, PMLR, pp 1558--1566

Larsen ABL, S nderby SK, Larochelle H, Winther O (2016) Autoencoding beyond pixels using a learned similarity metric. In: International conference on machine learning, PMLR, pp 1558--1566

2016

[21] [21]

arXiv preprint arXiv:250517011

Li Y, Tian C, Xia R, Liao N, Guo W, Yan J, Li H, Dai J, Li H, Yang X (2025) Learning adaptive and temporally causal video tokenization in a 1d latent space. arXiv preprint arXiv:250517011

2025

[22] [22]

(2024) Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding

Li Z, Zhang J, Lin Q, Xiong J, Long Y, Deng X, Zhang Y, Liu X, Huang M, Xiao Z, et al. (2024) Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding. arXiv preprint arXiv:240508748

2024

[23] [23]

In: European Conference on Computer Vision, Springer, pp 389--406

Liu F, Wang H, Chen W, Sun H, Duan Y (2024) Make-your-3d: Fast and consistent subject-driven 3d content generation. In: European Conference on Computer Vision, Springer, pp 389--406

2024

[24] [24]

arXiv preprint arXiv:250607136

Liu H, Sun W, Zhang Q, Di D, Gong B, Li H, Wei C, Zou C (2025) Hi-vae: Efficient video autoencoding with global and detailed motion. arXiv preprint arXiv:250607136

2025

[25] [25]

In: European Conference on Computer Vision, Springer, pp 23--40

Ma N, Goldstein M, Albergo MS, Boffi NM, Vanden-Eijnden E, Xie S (2024) Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In: European Conference on Computer Vision, Springer, pp 23--40

2024

[26] [26]

arXiv preprint arXiv:230701952

Podell D, English Z, Lacey K, Blattmann A, Dockhorn T, M \"u ller J, Penna J, Rombach R (2023) Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:230701952

2023

[27] [27]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4209--4219

Ren X, Huang J, Zeng X, Museth K, Fidler S, Williams F (2024 a ) Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4209--4219

2024

[28] [28]

Advances in Neural Information Processing Systems 37:97670--97698

Ren X, Lu Y, Liang H, Wu Z, Ling H, Chen M, Fidler S, Williams F, Huang J (2024 b ) Scube: Instant large-scale scene reconstruction using voxsplats. Advances in Neural Information Processing Systems 37:97670--97698

2024

[29] [29]

264 and MPEG-4 video compression: video coding for next-generation multimedia

Richardson IE (2004) H. 264 and MPEG-4 video compression: video coding for next-generation multimedia. John Wiley & Sons

2004

[30] [30]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10684--10695

Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B (2022) High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10684--10695

2022

[31] [31]

International Journal of Computer Vision 134(6), doi:10.1007/s11263-026-02901-4

Shu Y, Qiu Z, Yao T, Mei T (2026) Guidedvdm: Controllable video generation with long-term consistency. International Journal of Computer Vision 134(6), doi:10.1007/s11263-026-02901-4

work page doi:10.1007/s11263-026-02901-4 2026

[32] [32]

arXiv preprint arXiv:12120402

Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:12120402

2012

[33] [33]

arXiv preprint arXiv:241210443

Tan Z, Xue B, Jia J, Wang J, Ye W, Shi S, Sun M, Wu W, Chen Q, Jiang P (2024) Sweettok: Semantic-aware spatial-temporal tokenizer for compact video discretization. arXiv preprint arXiv:241210443

2024

[34] [34]

arXiv preprint arXiv:241213061

Tang A, He T, Guo J, Cheng X, Song L, Bian J (2024) Vidtok: A versatile and open-source video tokenizer. arXiv preprint arXiv:241213061

2024

[35] [35]

Advances in neural information processing systems 37:84839--84865

Tian K, Jiang Y, Yuan Z, Peng B, Wang L (2024 a ) Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems 37:84839--84865

2024

[36] [36]

arXiv preprint arXiv:241113552

Tian R, Dai Q, Bao J, Qiu K, Yang Y, Luo C, Wu Z, Jiang YG (2024 b ) Reducio! generating 1k video within 16 seconds using extremely compressed motion latents. arXiv preprint arXiv:241113552

2024

[37] [37]

arXiv preprint arXiv:181201717

Unterthiner T, Van Steenkiste S, Kurach K, Marinier R, Michalski M, Gelly S (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717

2018

[38] [38]

arXiv preprint arXiv:241021264

Wang H, Suri S, Ren Y, Chen H, Shrivastava A (2024 a ) Larp: Tokenizing videos with a learned autoregressive generative prior. arXiv preprint arXiv:241021264

2024

[39] [39]

Advances in Neural Information Processing Systems 37:28281--28295

Wang J, Jiang Y, Yuan Z, Peng B, Wu Z, Jiang YG (2024 b ) Omnitokenizer: A joint image-video tokenizer for visual generation. Advances in Neural Information Processing Systems 37:28281--28295

2024

[40] [40]

International Journal of Computer Vision 134(1), doi:10.1007/s11263-025-02604-2

Wang S, Shen L, Xiao J, Tian Z, Wang F, Hu X, Zhu Y, Feng G (2026) Breaking redundancy via 3d sparse geometry: 3d-aware neural compression for multi-view videos. International Journal of Computer Vision 134(1), doi:10.1007/s11263-025-02604-2

work page doi:10.1007/s11263-025-02604-2 2026

[41] [41]

Advances in Neural Information Processing Systems 37:65618--65642

Wang W, Yang Y (2024) Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Advances in Neural Information Processing Systems 37:65618--65642

2024

[42] [42]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 22922--22932

Wang Y, Guo J, Xie X, He T, Sun X, Bian J (2025) Vidtwin: Video vae with decoupled structure and dynamics. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 22922--22932

2025

[43] [43]

IEEE transactions on image processing 13(4):600--612

Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4):600--612

2004

[44] [44]

Advances in Neural Information Processing Systems 37:68082--68119

Wu J, Yin S, Feng N, He X, Li D, Hao J, Long M (2024) ivideogpt: Interactive videogpts are scalable world models. Advances in Neural Information Processing Systems 37:68082--68119

2024

[45] [45]

International Journal of Computer Vision 131(10):2699--2722, doi:10.1007/s11263-023-01832-8

Xu X, Wang Y, Wang L, Yu B, Jia J (2023) Conditional temporal variational autoencoder for action video prediction. International Journal of Computer Vision 131(10):2699--2722, doi:10.1007/s11263-023-01832-8

work page doi:10.1007/s11263-023-01832-8 2023

[46] [46]

arXiv preprint arXiv:241008368

Yan W, Mnih V, Faust A, Zaharia M, Abbeel P, Liu H (2024) Elastictok: Adaptive tokenization for image and video. arXiv preprint arXiv:241008368

2024

[47] [47]

International Journal of Computer Vision 129(5):1451--1466, doi:10.1007/s11263-020-01429-5

Yang C, Shen Y, Zhou B (2021) Semantic hierarchy emerges in deep generative representations for scene synthesis. International Journal of Computer Vision 129(5):1451--1466, doi:10.1007/s11263-020-01429-5

work page doi:10.1007/s11263-020-01429-5 2021

[48] [48]

generation: Taming optimization dilemma in latent diffusion models

Yao J, Yang B, Wang X (2025) Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 15703--15712

2025

[49] [49]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 22888--22897

Yoo J, Kim S, Lee D, Kim C, Hong S (2023) Towards end-to-end generative modeling of long videos with memory-efficient bidirectional transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 22888--22897

2023

[50] [50]

Advances in Neural Information Processing Systems 37:128940--128966

Yu Q, Weber M, Deng X, Shen X, Cremers D, Chen LC (2024 a ) An image is worth 32 tokens for reconstruction and generation. Advances in Neural Information Processing Systems 37:128940--128966

2024

[51] [51]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18456--18466

Yu S, Sohn K, Kim S, Shin J (2023) Video probabilistic diffusion models in projected latent space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18456--18466

2023

[52] [52]

arXiv preprint arXiv:240314148

Yu S, Nie W, Huang DA, Li B, Shin J, Anandkumar A (2024 b ) Efficient video diffusion models via content-frame motion-latent decomposition. arXiv preprint arXiv:240314148

2024

[53] [53]

arXiv preprint arXiv:250212632

Yu S, Hahn M, Kondratyuk D, Shin J, Gupta A, Lezama J, Essa I, Ross D, Huang J (2025) Malt diffusion: Memory-augmented latent transformers for any-length video generation. arXiv preprint arXiv:250212632

2025

[54] [54]

In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586--595

Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586--595

2018

[55] [55]

International Journal of Computer Vision 133(7):4909--4922, doi:10.1007/s11263-025-02413-7

Zhang Y, Wang X, Chen H, Qin C, Hao Y, Mei H, Zhu W (2025) Scenariodiff: Text-to-video generation with dynamic transformations of scene conditions. International Journal of Computer Vision 133(7):4909--4922, doi:10.1007/s11263-025-02413-7

work page doi:10.1007/s11263-025-02413-7 2025

[56] [56]

Advances in Neural Information Processing Systems 37:12847--12871

Zhao S, Zhang Y, Cun X, Yang S, Niu M, Li X, Hu W, Shan Y (2024) Cv-vae: A compatible video vae for latent generative video models. Advances in Neural Information Processing Systems 37:12847--12871

2024

[57] [57]

In: European conference on computer vision, Springer, pp 55--72

Zheng W, Chen W, Huang Y, Zhang B, Duan Y, Lu J (2024 a ) Occworld: Learning a 3d occupancy world model for autonomous driving. In: European conference on computer vision, Springer, pp 55--72

2024

[58] [58]

arXiv preprint arXiv:241220404

Zheng Z, Peng X, Yang T, Shen C, Li S, Liu H, Zhou Y, Li T, You Y (2024 b ) Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:241220404

2024

[59] [59]

FirstName Alpher , title =

[60] [60]

Journal of Foo , volume = 13, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

[61] [61]

Journal of Foo , volume = 14, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

[62] [62]

FirstName Alpher and FirstName Gamow , title =

[63] [63]

Computer Vision -- ECCV 2022 , year =

2022

[64] [64]

2025 , url=

Zhong, Tianxiong and Tian, Xingye and Jiang, Boyuan and Wang, Xuebo and Tao, Xin and Wan, Pengfei and Zhang, Zhiwei , booktitle=. 2025 , url=

2025

[65] [65]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

[66] [66]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

[67] [67]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016

[68] [68]

European Conference on Computer Vision , pages=

Make-your-3d: Fast and consistent subject-driven 3d content generation , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024

[69] [69]

arXiv preprint arXiv:2412.09600 , year=

Owl-1: Omni world model for consistent long video generation , author=. arXiv preprint arXiv:2412.09600 , year=

arXiv

[70] [70]

arXiv preprint arXiv:2408.12601 , year=

Dreamcinema: Cinematic transfer with free camera and 3d character , author=. arXiv preprint arXiv:2408.12601 , year=

arXiv

[71] [71]

European conference on computer vision , pages=

Occworld: Learning a 3d occupancy world model for autonomous driving , author=. European conference on computer vision , pages=. 2024 , organization=

2024

[72] [72]

arXiv preprint arXiv:2506.10981 , year=

SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis , author=. arXiv preprint arXiv:2506.10981 , year=

Pith/arXiv arXiv

[73] [73]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[74] [74]

arXiv preprint arXiv:2311.15127 , year=

Stable video diffusion: Scaling latent video diffusion models to large datasets , author=. arXiv preprint arXiv:2311.15127 , year=

Pith/arXiv arXiv

[75] [75]

International Journal of Computer Vision , volume=

ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions , author=. International Journal of Computer Vision , volume=. 2025 , doi=

2025

[76] [76]

International Journal of Computer Vision , volume=

GuidedVDM: Controllable Video Generation with Long-Term Consistency , author=. International Journal of Computer Vision , volume=. 2026 , doi=

2026

[77] [77]

International Journal of Computer Vision , volume=

Semantic Hierarchy Emerges in Deep Generative Representations for Scene Synthesis , author=. International Journal of Computer Vision , volume=. 2021 , doi=

2021

[78] [78]

International Journal of Computer Vision , volume=

Conditional Temporal Variational AutoEncoder for Action Video Prediction , author=. International Journal of Computer Vision , volume=. 2023 , doi=

2023

[79] [79]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Animate anyone: Consistent and controllable image-to-video synthesis for character animation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[80] [80]

arXiv preprint arXiv:2506.10962 , year=

SpectralAR: Spectral Autoregressive Visual Generation , author=. arXiv preprint arXiv:2506.10962 , year=

arXiv