pith. sign in

arxiv: 2606.17590 · v1 · pith:DBPEFAZPnew · submitted 2026-06-16 · 💻 cs.CV

TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization

Pith reviewed 2026-06-27 01:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords video tokenizationtime-invariant tokenstime-variant tokensScope-Induced FactorizationInvariant Broadcastingvideo compressionscalable video generation
0
0 comments X

The pith

TivTok encodes persistent video content once in reusable time-invariant tokens and frame-specific changes in variant tokens to cut total token usage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TivTok as a tokenizer that splits each video clip into time-invariant tokens capturing content shared across all frames and time-variant tokens holding only the residuals unique to each frame. Scope-Induced Factorization enforces this split by giving invariant tokens attention over the entire clip while restricting variant tokens to their own frame plus the invariants. Invariant Broadcasting then reuses the same invariant tokens for every frame and across chunks during decoding. This reuse directly lowers the token count needed for reconstruction, which the authors show improves efficiency especially on longer sequences that contain static elements.

Core claim

TivTok represents a clip with Time-Invariant (TIV) tokens that encode information shared across frames and Time-Variant (TV) tokens that encode frame-specific residuals. Scope-Induced Factorization assigns different attention scopes to the two groups so TIV tokens attend to the full clip while each TV token accesses only its frame together with the TIV tokens. In the decoder, Invariant Broadcasting reuses the same TIV tokens across frames and chunks for parallel reconstruction and long-video tokenization.

What carries the argument

Scope-Induced Factorization (SIF) that enforces full-clip attention for TIV tokens versus frame-limited attention for TV tokens, together with Invariant Broadcasting (IB) that reuses TIV tokens across time.

If this is right

  • Achieves an rFVD of 12.65 on the 16×256×256 benchmark
  • Delivers 2.91× higher compression efficiency on 128-frame videos versus evaluated baselines
  • Uses only 1.1% of the tokens required by downsample-based tokenizers
  • Enables parallel reconstruction and tokenization of long videos through cross-chunk reuse of invariant tokens

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same factorization principle could be tested on other sequential data such as audio waveforms or motion capture where some features remain stable over time.
  • Generation models built on top of TivTok tokens might allocate separate capacity to modeling invariants versus variants, potentially improving coherence over very long outputs.
  • If the reuse ratio holds at larger scales, training and inference costs for video models could drop roughly in proportion to the reported token reduction.

Load-bearing premise

Persistent content such as static backgrounds and consistent object appearances can be effectively encoded in time-invariant tokens reusable across frames and chunks without substantial reconstruction loss.

What would settle it

Reconstruction quality measured on videos whose backgrounds or object appearances change substantially between frames; if error rises markedly above that of non-reuse baselines, the factorization benefit disappears.

read the original abstract

Video tokenization is fundamental to scalable video generation, as the number of tokens directly determines the computational cost and the length of videos that can be modeled. Existing tokenizers mainly improve scalability by compressing videos into fewer tokens, but they often continue to represent persistent content, such as static backgrounds and consistent object appearances, repeatedly across frames and chunks. In this paper, we propose \textbf{TivTok} (\textit{Time-Invariant Tokenizer}), a reuse-aware video tokenizer that makes persistent information reusable across time. TivTok represents a clip with Time-Invariant (TIV) tokens that encode information shared across frames and Time-Variant (TV) tokens that encode frame-specific residuals. To obtain this factorization, we introduce Scope-Induced Factorization (SIF), which assigns different attention scopes to the two token groups: TIV tokens attend to the full clip, whereas each TV token only accesses its corresponding frame together with the TIV tokens. In the decoder, Invariant Broadcasting (IB) reuses the same TIV tokens across frames and chunks for parallel reconstruction and long-video tokenization. Experiments show that TivTok achieves an rFVD of 12.65 on the standard $16{\times}256{\times}256$ benchmark and improves compression efficiency by 2.91$\times$ for 128-frame videos compared with the evaluated baselines, while using only 1.1\% of the tokens required by downsample-based tokenizers in our evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes TivTok, a reuse-aware video tokenizer that factorizes each clip into Time-Invariant (TIV) tokens capturing persistent content across frames and Time-Variant (TV) tokens capturing frame-specific residuals. Scope-Induced Factorization (SIF) enforces this split by giving TIV tokens full-clip attention scope while restricting each TV token to its own frame plus the TIV tokens; Invariant Broadcasting (IB) then reuses the same TIV tokens across frames and chunks during decoding. On the standard 16×256×256 benchmark the method reports an rFVD of 12.65, a 2.91× compression-efficiency gain for 128-frame videos relative to evaluated baselines, and a token count equal to only 1.1 % of that required by downsample-based tokenizers.

Significance. If the reported metrics are reproducible, the work would meaningfully advance scalable video generation by directly attacking the redundancy of repeatedly tokenizing static backgrounds and consistent object appearances. The SIF/IB construction offers a clean architectural mechanism for token reuse that could extend to longer videos without proportional growth in token budget.

major comments (1)
  1. Abstract: the central quantitative claims (rFVD = 12.65, 2.91× efficiency, 1.1 % token count) are presented without any description of training procedure, baseline implementations, dataset splits, or error analysis on dynamic content; these omissions are load-bearing for the scalability assertions that rest on the SIF/IB factorization working as described.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed feedback. We address the concern about the abstract below, noting that abstracts are necessarily concise while the full experimental details appear in the manuscript body.

read point-by-point responses
  1. Referee: [—] Abstract: the central quantitative claims (rFVD = 12.65, 2.91× efficiency, 1.1 % token count) are presented without any description of training procedure, baseline implementations, dataset splits, or error analysis on dynamic content; these omissions are load-bearing for the scalability assertions that rest on the SIF/IB factorization working as described.

    Authors: We agree that the abstract is brief by design and omits procedural details. The manuscript provides these in full: training procedure and hyperparameters are described in Section 4.1, baseline implementations and tokenization comparisons in Section 4.2, dataset splits and benchmark construction in Section 4.3, and error analysis on dynamic versus static content (including per-category rFVD breakdowns) in Section 5.3 and the supplementary material. The reported rFVD of 12.65 follows the standard 16×256×256 protocol used by prior work; the 2.91× efficiency gain and 1.1 % token count are computed directly from the token budgets and reconstruction metrics on 128-frame clips. We believe these sections substantiate the scalability claims for the SIF/IB mechanism. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents TivTok as an architectural innovation using Scope-Induced Factorization (SIF) to separate time-invariant and time-variant tokens, followed by Invariant Broadcasting (IB) for reuse. Reported metrics (rFVD 12.65, 2.91× efficiency, 1.1% token count) are empirical outcomes on standard benchmarks rather than derivations or predictions that reduce to fitted inputs or self-definitions by construction. No equations, self-citation chains, or uniqueness theorems are exhibited in the supplied material that would force the central claims to be equivalent to their own inputs. The method is self-contained as a proposed tokenizer design evaluated experimentally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract-only review limits visibility into parameters or assumptions; core idea rests on domain assumption of separable persistent content.

axioms (1)
  • domain assumption Videos contain substantial persistent content that can be separated into time-invariant and time-variant components without major loss
    Foundational premise for SIF and IB to work as described.
invented entities (2)
  • Time-Invariant (TIV) tokens no independent evidence
    purpose: Encode information shared across frames for reuse
    New token category introduced by the method
  • Time-Variant (TV) tokens no independent evidence
    purpose: Encode frame-specific residuals
    New token category introduced by the method

pith-pipeline@v0.9.1-grok · 5801 in / 1224 out tokens · 44893 ms · 2026-06-27T01:17:46.181804+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

129 extracted references · 5 canonical work pages

  1. [1]

    (2025) Cosmos world foundation model platform for physical ai

    Agarwal N, Ali A, Bala M, Balaji Y, Barker E, Cai T, Chattopadhyay P, Chen Y, Cui Y, Ding Y, et al. (2025) Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:250103575

  2. [2]

    In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1728--1738

    Bain M, Nagrani A, Varol G, Zisserman A (2021) Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1728--1738

  3. [3]

    (2023 a ) Stable video diffusion: Scaling latent video diffusion models to large datasets

    Blattmann A, Dockhorn T, Kulal S, Mendelevitch D, Kilian M, Lorenz D, Levi Y, English Z, Voleti V, Letts A, et al. (2023 a ) Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:231115127

  4. [4]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22563--22575

    Blattmann A, Rombach R, Ling H, Dockhorn T, Kim SW, Fidler S, Kreis K (2023 b ) Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22563--22575

  5. [5]

    arXiv preprint arXiv:180801340

    Carreira J, Noland E, Banki-Horvath A, Hillier C, Zisserman A (2018) A short note about kinetics-600. arXiv preprint arXiv:180801340

  6. [6]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 28358--28370

    Chen H, Wang Z, Li X, Sun X, Chen F, Liu J, Wang J, Raj B, Liu Z, Barsoum E (2025 a ) Softvq-vae: Efficient 1-dimensional continuous tokenizer. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 28358--28370

  7. [7]

    arXiv preprint arXiv:240901199

    Chen L, Li Z, Lin B, Zhu B, Wang Q, Yuan S, Zhou X, Cheng X, Yuan L (2024 a ) Od-vae: An omni-dimensional video compressor for improving latent video diffusion model. arXiv preprint arXiv:240901199

  8. [8]

    arXiv preprint arXiv:240812601

    Chen W, Liu F, Wu D, Sun H, Lu J, Duan Y (2024 b ) Dreamcinema: Cinematic transfer with free camera and 3d character. arXiv preprint arXiv:240812601

  9. [9]

    arXiv preprint arXiv:250610981

    Chen W, Bi J, Huang Y, Zheng W, Duan Y (2025 b ) Scenecompleter: Dense 3d scene completion for generative novel view synthesis. arXiv preprint arXiv:250610981

  10. [10]

    The Alliance for Open Media 1:2

    De Rivaz P, Haughton J (2019) Av1 bitstream & decoding process specification. The Alliance for Open Media 1:2

  11. [11]

    Communications of the ACM 63(11):139--144

    Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Communications of the ACM 63(11):139--144

  12. [12]

    (2024) Ltx-video: Realtime video latent diffusion

    HaCohen Y, Chiprut N, Brazowski B, Shalem D, Moshe D, Richardson E, Levin E, Shiran G, Zabari N, Gordon O, et al. (2024) Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:250100103

  13. [13]

    arXiv preprint arXiv:250618899

    Huang K, Huang Y, Wang X, Lin Z, Ning X, Wan P, Zhang D, Wang Y, Liu X (2025 a ) Filmaster: Bridging cinematic principles and generative ai for automated film generation. arXiv preprint arXiv:250618899

  14. [14]

    arXiv preprint arXiv:241209600

    Huang Y, Zheng W, Gao Y, Tao X, Wan P, Zhang D, Zhou J, Lu J (2024) Owl-1: Omni world model for consistent long video generation. arXiv preprint arXiv:241209600

  15. [15]

    arXiv preprint arXiv:250610962

    Huang Y, Chen W, Zheng W, Duan Y, Zhou J, Lu J (2025 b ) Spectralar: Spectral autoregressive visual generation. arXiv preprint arXiv:250610962

  16. [16]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 22853--22863

    Jang H, Yu S, Shin J, Abbeel P, Seo Y (2025) Efficient long video tokenization via coordinate-based patch reconstruction. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 22853--22863

  17. [17]

    (2024) Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization

    Jin Y, Sun Z, Xu K, Chen L, Jiang H, Huang Q, Song C, Liu Y, Zhang D, Song Y, et al. (2024) Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization. arXiv preprint arXiv:240203161

  18. [18]

    In: European conference on computer vision, Springer, pp 694--711

    Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision, Springer, pp 694--711

  19. [19]

    In: European Conference on Computer Vision, Springer, pp 148--165

    Kim K, Lee H, Park J, Kim S, Lee K, Kim S, Yoo J (2024) Hybrid video diffusion models with 2d triplane and 3d wavelet representation. In: European Conference on Computer Vision, Springer, pp 148--165

  20. [20]

    In: International conference on machine learning, PMLR, pp 1558--1566

    Larsen ABL, S nderby SK, Larochelle H, Winther O (2016) Autoencoding beyond pixels using a learned similarity metric. In: International conference on machine learning, PMLR, pp 1558--1566

  21. [21]

    arXiv preprint arXiv:250517011

    Li Y, Tian C, Xia R, Liao N, Guo W, Yan J, Li H, Dai J, Li H, Yang X (2025) Learning adaptive and temporally causal video tokenization in a 1d latent space. arXiv preprint arXiv:250517011

  22. [22]

    (2024) Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding

    Li Z, Zhang J, Lin Q, Xiong J, Long Y, Deng X, Zhang Y, Liu X, Huang M, Xiao Z, et al. (2024) Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding. arXiv preprint arXiv:240508748

  23. [23]

    In: European Conference on Computer Vision, Springer, pp 389--406

    Liu F, Wang H, Chen W, Sun H, Duan Y (2024) Make-your-3d: Fast and consistent subject-driven 3d content generation. In: European Conference on Computer Vision, Springer, pp 389--406

  24. [24]

    arXiv preprint arXiv:250607136

    Liu H, Sun W, Zhang Q, Di D, Gong B, Li H, Wei C, Zou C (2025) Hi-vae: Efficient video autoencoding with global and detailed motion. arXiv preprint arXiv:250607136

  25. [25]

    In: European Conference on Computer Vision, Springer, pp 23--40

    Ma N, Goldstein M, Albergo MS, Boffi NM, Vanden-Eijnden E, Xie S (2024) Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In: European Conference on Computer Vision, Springer, pp 23--40

  26. [26]

    arXiv preprint arXiv:230701952

    Podell D, English Z, Lacey K, Blattmann A, Dockhorn T, M \"u ller J, Penna J, Rombach R (2023) Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:230701952

  27. [27]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4209--4219

    Ren X, Huang J, Zeng X, Museth K, Fidler S, Williams F (2024 a ) Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4209--4219

  28. [28]

    Advances in Neural Information Processing Systems 37:97670--97698

    Ren X, Lu Y, Liang H, Wu Z, Ling H, Chen M, Fidler S, Williams F, Huang J (2024 b ) Scube: Instant large-scale scene reconstruction using voxsplats. Advances in Neural Information Processing Systems 37:97670--97698

  29. [29]

    264 and MPEG-4 video compression: video coding for next-generation multimedia

    Richardson IE (2004) H. 264 and MPEG-4 video compression: video coding for next-generation multimedia. John Wiley & Sons

  30. [30]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10684--10695

    Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B (2022) High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10684--10695

  31. [31]

    International Journal of Computer Vision 134(6), doi:10.1007/s11263-026-02901-4

    Shu Y, Qiu Z, Yao T, Mei T (2026) Guidedvdm: Controllable video generation with long-term consistency. International Journal of Computer Vision 134(6), doi:10.1007/s11263-026-02901-4

  32. [32]

    arXiv preprint arXiv:12120402

    Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:12120402

  33. [33]

    arXiv preprint arXiv:241210443

    Tan Z, Xue B, Jia J, Wang J, Ye W, Shi S, Sun M, Wu W, Chen Q, Jiang P (2024) Sweettok: Semantic-aware spatial-temporal tokenizer for compact video discretization. arXiv preprint arXiv:241210443

  34. [34]

    arXiv preprint arXiv:241213061

    Tang A, He T, Guo J, Cheng X, Song L, Bian J (2024) Vidtok: A versatile and open-source video tokenizer. arXiv preprint arXiv:241213061

  35. [35]

    Advances in neural information processing systems 37:84839--84865

    Tian K, Jiang Y, Yuan Z, Peng B, Wang L (2024 a ) Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems 37:84839--84865

  36. [36]

    arXiv preprint arXiv:241113552

    Tian R, Dai Q, Bao J, Qiu K, Yang Y, Luo C, Wu Z, Jiang YG (2024 b ) Reducio! generating 1k video within 16 seconds using extremely compressed motion latents. arXiv preprint arXiv:241113552

  37. [37]

    arXiv preprint arXiv:181201717

    Unterthiner T, Van Steenkiste S, Kurach K, Marinier R, Michalski M, Gelly S (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717

  38. [38]

    arXiv preprint arXiv:241021264

    Wang H, Suri S, Ren Y, Chen H, Shrivastava A (2024 a ) Larp: Tokenizing videos with a learned autoregressive generative prior. arXiv preprint arXiv:241021264

  39. [39]

    Advances in Neural Information Processing Systems 37:28281--28295

    Wang J, Jiang Y, Yuan Z, Peng B, Wu Z, Jiang YG (2024 b ) Omnitokenizer: A joint image-video tokenizer for visual generation. Advances in Neural Information Processing Systems 37:28281--28295

  40. [40]

    International Journal of Computer Vision 134(1), doi:10.1007/s11263-025-02604-2

    Wang S, Shen L, Xiao J, Tian Z, Wang F, Hu X, Zhu Y, Feng G (2026) Breaking redundancy via 3d sparse geometry: 3d-aware neural compression for multi-view videos. International Journal of Computer Vision 134(1), doi:10.1007/s11263-025-02604-2

  41. [41]

    Advances in Neural Information Processing Systems 37:65618--65642

    Wang W, Yang Y (2024) Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Advances in Neural Information Processing Systems 37:65618--65642

  42. [42]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 22922--22932

    Wang Y, Guo J, Xie X, He T, Sun X, Bian J (2025) Vidtwin: Video vae with decoupled structure and dynamics. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 22922--22932

  43. [43]

    IEEE transactions on image processing 13(4):600--612

    Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4):600--612

  44. [44]

    Advances in Neural Information Processing Systems 37:68082--68119

    Wu J, Yin S, Feng N, He X, Li D, Hao J, Long M (2024) ivideogpt: Interactive videogpts are scalable world models. Advances in Neural Information Processing Systems 37:68082--68119

  45. [45]

    International Journal of Computer Vision 131(10):2699--2722, doi:10.1007/s11263-023-01832-8

    Xu X, Wang Y, Wang L, Yu B, Jia J (2023) Conditional temporal variational autoencoder for action video prediction. International Journal of Computer Vision 131(10):2699--2722, doi:10.1007/s11263-023-01832-8

  46. [46]

    arXiv preprint arXiv:241008368

    Yan W, Mnih V, Faust A, Zaharia M, Abbeel P, Liu H (2024) Elastictok: Adaptive tokenization for image and video. arXiv preprint arXiv:241008368

  47. [47]

    International Journal of Computer Vision 129(5):1451--1466, doi:10.1007/s11263-020-01429-5

    Yang C, Shen Y, Zhou B (2021) Semantic hierarchy emerges in deep generative representations for scene synthesis. International Journal of Computer Vision 129(5):1451--1466, doi:10.1007/s11263-020-01429-5

  48. [48]

    generation: Taming optimization dilemma in latent diffusion models

    Yao J, Yang B, Wang X (2025) Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 15703--15712

  49. [49]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 22888--22897

    Yoo J, Kim S, Lee D, Kim C, Hong S (2023) Towards end-to-end generative modeling of long videos with memory-efficient bidirectional transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 22888--22897

  50. [50]

    Advances in Neural Information Processing Systems 37:128940--128966

    Yu Q, Weber M, Deng X, Shen X, Cremers D, Chen LC (2024 a ) An image is worth 32 tokens for reconstruction and generation. Advances in Neural Information Processing Systems 37:128940--128966

  51. [51]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18456--18466

    Yu S, Sohn K, Kim S, Shin J (2023) Video probabilistic diffusion models in projected latent space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18456--18466

  52. [52]

    arXiv preprint arXiv:240314148

    Yu S, Nie W, Huang DA, Li B, Shin J, Anandkumar A (2024 b ) Efficient video diffusion models via content-frame motion-latent decomposition. arXiv preprint arXiv:240314148

  53. [53]

    arXiv preprint arXiv:250212632

    Yu S, Hahn M, Kondratyuk D, Shin J, Gupta A, Lezama J, Essa I, Ross D, Huang J (2025) Malt diffusion: Memory-augmented latent transformers for any-length video generation. arXiv preprint arXiv:250212632

  54. [54]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586--595

    Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586--595

  55. [55]

    International Journal of Computer Vision 133(7):4909--4922, doi:10.1007/s11263-025-02413-7

    Zhang Y, Wang X, Chen H, Qin C, Hao Y, Mei H, Zhu W (2025) Scenariodiff: Text-to-video generation with dynamic transformations of scene conditions. International Journal of Computer Vision 133(7):4909--4922, doi:10.1007/s11263-025-02413-7

  56. [56]

    Advances in Neural Information Processing Systems 37:12847--12871

    Zhao S, Zhang Y, Cun X, Yang S, Niu M, Li X, Hu W, Shan Y (2024) Cv-vae: A compatible video vae for latent generative video models. Advances in Neural Information Processing Systems 37:12847--12871

  57. [57]

    In: European conference on computer vision, Springer, pp 55--72

    Zheng W, Chen W, Huang Y, Zhang B, Duan Y, Lu J (2024 a ) Occworld: Learning a 3d occupancy world model for autonomous driving. In: European conference on computer vision, Springer, pp 55--72

  58. [58]

    arXiv preprint arXiv:241220404

    Zheng Z, Peng X, Yang T, Shen C, Li S, Liu H, Zhou Y, Li T, You Y (2024 b ) Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:241220404

  59. [59]

    FirstName Alpher , title =

  60. [60]

    Journal of Foo , volume = 13, number = 1, pages =

    FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

  61. [61]

    Journal of Foo , volume = 14, number = 1, pages =

    FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

  62. [62]

    FirstName Alpher and FirstName Gamow , title =

  63. [63]

    Computer Vision -- ECCV 2022 , year =

  64. [64]

    2025 , url=

    Zhong, Tianxiong and Tian, Xingye and Jiang, Boyuan and Wang, Xuebo and Tao, Xin and Wan, Pengfei and Zhang, Zhiwei , booktitle=. 2025 , url=

  65. [65]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  66. [66]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  67. [67]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  68. [68]

    European Conference on Computer Vision , pages=

    Make-your-3d: Fast and consistent subject-driven 3d content generation , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  69. [69]

    arXiv preprint arXiv:2412.09600 , year=

    Owl-1: Omni world model for consistent long video generation , author=. arXiv preprint arXiv:2412.09600 , year=

  70. [70]

    arXiv preprint arXiv:2408.12601 , year=

    Dreamcinema: Cinematic transfer with free camera and 3d character , author=. arXiv preprint arXiv:2408.12601 , year=

  71. [71]

    European conference on computer vision , pages=

    Occworld: Learning a 3d occupancy world model for autonomous driving , author=. European conference on computer vision , pages=. 2024 , organization=

  72. [72]

    arXiv preprint arXiv:2506.10981 , year=

    SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis , author=. arXiv preprint arXiv:2506.10981 , year=

  73. [73]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  74. [74]

    arXiv preprint arXiv:2311.15127 , year=

    Stable video diffusion: Scaling latent video diffusion models to large datasets , author=. arXiv preprint arXiv:2311.15127 , year=

  75. [75]

    International Journal of Computer Vision , volume=

    ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions , author=. International Journal of Computer Vision , volume=. 2025 , doi=

  76. [76]

    International Journal of Computer Vision , volume=

    GuidedVDM: Controllable Video Generation with Long-Term Consistency , author=. International Journal of Computer Vision , volume=. 2026 , doi=

  77. [77]

    International Journal of Computer Vision , volume=

    Semantic Hierarchy Emerges in Deep Generative Representations for Scene Synthesis , author=. International Journal of Computer Vision , volume=. 2021 , doi=

  78. [78]

    International Journal of Computer Vision , volume=

    Conditional Temporal Variational AutoEncoder for Action Video Prediction , author=. International Journal of Computer Vision , volume=. 2023 , doi=

  79. [79]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Animate anyone: Consistent and controllable image-to-video synthesis for character animation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  80. [80]

    arXiv preprint arXiv:2506.10962 , year=

    SpectralAR: Spectral Autoregressive Visual Generation , author=. arXiv preprint arXiv:2506.10962 , year=

Showing first 80 references.