TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization
Pith reviewed 2026-06-27 01:17 UTC · model grok-4.3
The pith
TivTok encodes persistent video content once in reusable time-invariant tokens and frame-specific changes in variant tokens to cut total token usage.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TivTok represents a clip with Time-Invariant (TIV) tokens that encode information shared across frames and Time-Variant (TV) tokens that encode frame-specific residuals. Scope-Induced Factorization assigns different attention scopes to the two groups so TIV tokens attend to the full clip while each TV token accesses only its frame together with the TIV tokens. In the decoder, Invariant Broadcasting reuses the same TIV tokens across frames and chunks for parallel reconstruction and long-video tokenization.
What carries the argument
Scope-Induced Factorization (SIF) that enforces full-clip attention for TIV tokens versus frame-limited attention for TV tokens, together with Invariant Broadcasting (IB) that reuses TIV tokens across time.
If this is right
- Achieves an rFVD of 12.65 on the 16×256×256 benchmark
- Delivers 2.91× higher compression efficiency on 128-frame videos versus evaluated baselines
- Uses only 1.1% of the tokens required by downsample-based tokenizers
- Enables parallel reconstruction and tokenization of long videos through cross-chunk reuse of invariant tokens
Where Pith is reading between the lines
- The same factorization principle could be tested on other sequential data such as audio waveforms or motion capture where some features remain stable over time.
- Generation models built on top of TivTok tokens might allocate separate capacity to modeling invariants versus variants, potentially improving coherence over very long outputs.
- If the reuse ratio holds at larger scales, training and inference costs for video models could drop roughly in proportion to the reported token reduction.
Load-bearing premise
Persistent content such as static backgrounds and consistent object appearances can be effectively encoded in time-invariant tokens reusable across frames and chunks without substantial reconstruction loss.
What would settle it
Reconstruction quality measured on videos whose backgrounds or object appearances change substantially between frames; if error rises markedly above that of non-reuse baselines, the factorization benefit disappears.
read the original abstract
Video tokenization is fundamental to scalable video generation, as the number of tokens directly determines the computational cost and the length of videos that can be modeled. Existing tokenizers mainly improve scalability by compressing videos into fewer tokens, but they often continue to represent persistent content, such as static backgrounds and consistent object appearances, repeatedly across frames and chunks. In this paper, we propose \textbf{TivTok} (\textit{Time-Invariant Tokenizer}), a reuse-aware video tokenizer that makes persistent information reusable across time. TivTok represents a clip with Time-Invariant (TIV) tokens that encode information shared across frames and Time-Variant (TV) tokens that encode frame-specific residuals. To obtain this factorization, we introduce Scope-Induced Factorization (SIF), which assigns different attention scopes to the two token groups: TIV tokens attend to the full clip, whereas each TV token only accesses its corresponding frame together with the TIV tokens. In the decoder, Invariant Broadcasting (IB) reuses the same TIV tokens across frames and chunks for parallel reconstruction and long-video tokenization. Experiments show that TivTok achieves an rFVD of 12.65 on the standard $16{\times}256{\times}256$ benchmark and improves compression efficiency by 2.91$\times$ for 128-frame videos compared with the evaluated baselines, while using only 1.1\% of the tokens required by downsample-based tokenizers in our evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes TivTok, a reuse-aware video tokenizer that factorizes each clip into Time-Invariant (TIV) tokens capturing persistent content across frames and Time-Variant (TV) tokens capturing frame-specific residuals. Scope-Induced Factorization (SIF) enforces this split by giving TIV tokens full-clip attention scope while restricting each TV token to its own frame plus the TIV tokens; Invariant Broadcasting (IB) then reuses the same TIV tokens across frames and chunks during decoding. On the standard 16×256×256 benchmark the method reports an rFVD of 12.65, a 2.91× compression-efficiency gain for 128-frame videos relative to evaluated baselines, and a token count equal to only 1.1 % of that required by downsample-based tokenizers.
Significance. If the reported metrics are reproducible, the work would meaningfully advance scalable video generation by directly attacking the redundancy of repeatedly tokenizing static backgrounds and consistent object appearances. The SIF/IB construction offers a clean architectural mechanism for token reuse that could extend to longer videos without proportional growth in token budget.
major comments (1)
- Abstract: the central quantitative claims (rFVD = 12.65, 2.91× efficiency, 1.1 % token count) are presented without any description of training procedure, baseline implementations, dataset splits, or error analysis on dynamic content; these omissions are load-bearing for the scalability assertions that rest on the SIF/IB factorization working as described.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback. We address the concern about the abstract below, noting that abstracts are necessarily concise while the full experimental details appear in the manuscript body.
read point-by-point responses
-
Referee: [—] Abstract: the central quantitative claims (rFVD = 12.65, 2.91× efficiency, 1.1 % token count) are presented without any description of training procedure, baseline implementations, dataset splits, or error analysis on dynamic content; these omissions are load-bearing for the scalability assertions that rest on the SIF/IB factorization working as described.
Authors: We agree that the abstract is brief by design and omits procedural details. The manuscript provides these in full: training procedure and hyperparameters are described in Section 4.1, baseline implementations and tokenization comparisons in Section 4.2, dataset splits and benchmark construction in Section 4.3, and error analysis on dynamic versus static content (including per-category rFVD breakdowns) in Section 5.3 and the supplementary material. The reported rFVD of 12.65 follows the standard 16×256×256 protocol used by prior work; the 2.91× efficiency gain and 1.1 % token count are computed directly from the token budgets and reconstruction metrics on 128-frame clips. We believe these sections substantiate the scalability claims for the SIF/IB mechanism. revision: no
Circularity Check
No significant circularity detected
full rationale
The paper presents TivTok as an architectural innovation using Scope-Induced Factorization (SIF) to separate time-invariant and time-variant tokens, followed by Invariant Broadcasting (IB) for reuse. Reported metrics (rFVD 12.65, 2.91× efficiency, 1.1% token count) are empirical outcomes on standard benchmarks rather than derivations or predictions that reduce to fitted inputs or self-definitions by construction. No equations, self-citation chains, or uniqueness theorems are exhibited in the supplied material that would force the central claims to be equivalent to their own inputs. The method is self-contained as a proposed tokenizer design evaluated experimentally.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Videos contain substantial persistent content that can be separated into time-invariant and time-variant components without major loss
invented entities (2)
-
Time-Invariant (TIV) tokens
no independent evidence
-
Time-Variant (TV) tokens
no independent evidence
Reference graph
Works this paper leans on
-
[1]
(2025) Cosmos world foundation model platform for physical ai
Agarwal N, Ali A, Bala M, Balaji Y, Barker E, Cai T, Chattopadhyay P, Chen Y, Cui Y, Ding Y, et al. (2025) Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:250103575
2025
-
[2]
In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1728--1738
Bain M, Nagrani A, Varol G, Zisserman A (2021) Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1728--1738
2021
-
[3]
(2023 a ) Stable video diffusion: Scaling latent video diffusion models to large datasets
Blattmann A, Dockhorn T, Kulal S, Mendelevitch D, Kilian M, Lorenz D, Levi Y, English Z, Voleti V, Letts A, et al. (2023 a ) Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:231115127
2023
-
[4]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22563--22575
Blattmann A, Rombach R, Ling H, Dockhorn T, Kim SW, Fidler S, Kreis K (2023 b ) Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22563--22575
2023
-
[5]
arXiv preprint arXiv:180801340
Carreira J, Noland E, Banki-Horvath A, Hillier C, Zisserman A (2018) A short note about kinetics-600. arXiv preprint arXiv:180801340
2018
-
[6]
In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 28358--28370
Chen H, Wang Z, Li X, Sun X, Chen F, Liu J, Wang J, Raj B, Liu Z, Barsoum E (2025 a ) Softvq-vae: Efficient 1-dimensional continuous tokenizer. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 28358--28370
2025
-
[7]
arXiv preprint arXiv:240901199
Chen L, Li Z, Lin B, Zhu B, Wang Q, Yuan S, Zhou X, Cheng X, Yuan L (2024 a ) Od-vae: An omni-dimensional video compressor for improving latent video diffusion model. arXiv preprint arXiv:240901199
2024
-
[8]
arXiv preprint arXiv:240812601
Chen W, Liu F, Wu D, Sun H, Lu J, Duan Y (2024 b ) Dreamcinema: Cinematic transfer with free camera and 3d character. arXiv preprint arXiv:240812601
2024
-
[9]
arXiv preprint arXiv:250610981
Chen W, Bi J, Huang Y, Zheng W, Duan Y (2025 b ) Scenecompleter: Dense 3d scene completion for generative novel view synthesis. arXiv preprint arXiv:250610981
2025
-
[10]
The Alliance for Open Media 1:2
De Rivaz P, Haughton J (2019) Av1 bitstream & decoding process specification. The Alliance for Open Media 1:2
2019
-
[11]
Communications of the ACM 63(11):139--144
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Communications of the ACM 63(11):139--144
2020
-
[12]
(2024) Ltx-video: Realtime video latent diffusion
HaCohen Y, Chiprut N, Brazowski B, Shalem D, Moshe D, Richardson E, Levin E, Shiran G, Zabari N, Gordon O, et al. (2024) Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:250100103
2024
-
[13]
arXiv preprint arXiv:250618899
Huang K, Huang Y, Wang X, Lin Z, Ning X, Wan P, Zhang D, Wang Y, Liu X (2025 a ) Filmaster: Bridging cinematic principles and generative ai for automated film generation. arXiv preprint arXiv:250618899
2025
-
[14]
arXiv preprint arXiv:241209600
Huang Y, Zheng W, Gao Y, Tao X, Wan P, Zhang D, Zhou J, Lu J (2024) Owl-1: Omni world model for consistent long video generation. arXiv preprint arXiv:241209600
2024
-
[15]
arXiv preprint arXiv:250610962
Huang Y, Chen W, Zheng W, Duan Y, Zhou J, Lu J (2025 b ) Spectralar: Spectral autoregressive visual generation. arXiv preprint arXiv:250610962
2025
-
[16]
In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 22853--22863
Jang H, Yu S, Shin J, Abbeel P, Seo Y (2025) Efficient long video tokenization via coordinate-based patch reconstruction. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 22853--22863
2025
-
[17]
(2024) Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization
Jin Y, Sun Z, Xu K, Chen L, Jiang H, Huang Q, Song C, Liu Y, Zhang D, Song Y, et al. (2024) Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization. arXiv preprint arXiv:240203161
2024
-
[18]
In: European conference on computer vision, Springer, pp 694--711
Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision, Springer, pp 694--711
2016
-
[19]
In: European Conference on Computer Vision, Springer, pp 148--165
Kim K, Lee H, Park J, Kim S, Lee K, Kim S, Yoo J (2024) Hybrid video diffusion models with 2d triplane and 3d wavelet representation. In: European Conference on Computer Vision, Springer, pp 148--165
2024
-
[20]
In: International conference on machine learning, PMLR, pp 1558--1566
Larsen ABL, S nderby SK, Larochelle H, Winther O (2016) Autoencoding beyond pixels using a learned similarity metric. In: International conference on machine learning, PMLR, pp 1558--1566
2016
-
[21]
arXiv preprint arXiv:250517011
Li Y, Tian C, Xia R, Liao N, Guo W, Yan J, Li H, Dai J, Li H, Yang X (2025) Learning adaptive and temporally causal video tokenization in a 1d latent space. arXiv preprint arXiv:250517011
2025
-
[22]
(2024) Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding
Li Z, Zhang J, Lin Q, Xiong J, Long Y, Deng X, Zhang Y, Liu X, Huang M, Xiao Z, et al. (2024) Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding. arXiv preprint arXiv:240508748
2024
-
[23]
In: European Conference on Computer Vision, Springer, pp 389--406
Liu F, Wang H, Chen W, Sun H, Duan Y (2024) Make-your-3d: Fast and consistent subject-driven 3d content generation. In: European Conference on Computer Vision, Springer, pp 389--406
2024
-
[24]
arXiv preprint arXiv:250607136
Liu H, Sun W, Zhang Q, Di D, Gong B, Li H, Wei C, Zou C (2025) Hi-vae: Efficient video autoencoding with global and detailed motion. arXiv preprint arXiv:250607136
2025
-
[25]
In: European Conference on Computer Vision, Springer, pp 23--40
Ma N, Goldstein M, Albergo MS, Boffi NM, Vanden-Eijnden E, Xie S (2024) Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In: European Conference on Computer Vision, Springer, pp 23--40
2024
-
[26]
arXiv preprint arXiv:230701952
Podell D, English Z, Lacey K, Blattmann A, Dockhorn T, M \"u ller J, Penna J, Rombach R (2023) Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:230701952
2023
-
[27]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4209--4219
Ren X, Huang J, Zeng X, Museth K, Fidler S, Williams F (2024 a ) Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4209--4219
2024
-
[28]
Advances in Neural Information Processing Systems 37:97670--97698
Ren X, Lu Y, Liang H, Wu Z, Ling H, Chen M, Fidler S, Williams F, Huang J (2024 b ) Scube: Instant large-scale scene reconstruction using voxsplats. Advances in Neural Information Processing Systems 37:97670--97698
2024
-
[29]
264 and MPEG-4 video compression: video coding for next-generation multimedia
Richardson IE (2004) H. 264 and MPEG-4 video compression: video coding for next-generation multimedia. John Wiley & Sons
2004
-
[30]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10684--10695
Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B (2022) High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10684--10695
2022
-
[31]
International Journal of Computer Vision 134(6), doi:10.1007/s11263-026-02901-4
Shu Y, Qiu Z, Yao T, Mei T (2026) Guidedvdm: Controllable video generation with long-term consistency. International Journal of Computer Vision 134(6), doi:10.1007/s11263-026-02901-4
-
[32]
arXiv preprint arXiv:12120402
Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:12120402
2012
-
[33]
arXiv preprint arXiv:241210443
Tan Z, Xue B, Jia J, Wang J, Ye W, Shi S, Sun M, Wu W, Chen Q, Jiang P (2024) Sweettok: Semantic-aware spatial-temporal tokenizer for compact video discretization. arXiv preprint arXiv:241210443
2024
-
[34]
arXiv preprint arXiv:241213061
Tang A, He T, Guo J, Cheng X, Song L, Bian J (2024) Vidtok: A versatile and open-source video tokenizer. arXiv preprint arXiv:241213061
2024
-
[35]
Advances in neural information processing systems 37:84839--84865
Tian K, Jiang Y, Yuan Z, Peng B, Wang L (2024 a ) Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems 37:84839--84865
2024
-
[36]
arXiv preprint arXiv:241113552
Tian R, Dai Q, Bao J, Qiu K, Yang Y, Luo C, Wu Z, Jiang YG (2024 b ) Reducio! generating 1k video within 16 seconds using extremely compressed motion latents. arXiv preprint arXiv:241113552
2024
-
[37]
arXiv preprint arXiv:181201717
Unterthiner T, Van Steenkiste S, Kurach K, Marinier R, Michalski M, Gelly S (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717
2018
-
[38]
arXiv preprint arXiv:241021264
Wang H, Suri S, Ren Y, Chen H, Shrivastava A (2024 a ) Larp: Tokenizing videos with a learned autoregressive generative prior. arXiv preprint arXiv:241021264
2024
-
[39]
Advances in Neural Information Processing Systems 37:28281--28295
Wang J, Jiang Y, Yuan Z, Peng B, Wu Z, Jiang YG (2024 b ) Omnitokenizer: A joint image-video tokenizer for visual generation. Advances in Neural Information Processing Systems 37:28281--28295
2024
-
[40]
International Journal of Computer Vision 134(1), doi:10.1007/s11263-025-02604-2
Wang S, Shen L, Xiao J, Tian Z, Wang F, Hu X, Zhu Y, Feng G (2026) Breaking redundancy via 3d sparse geometry: 3d-aware neural compression for multi-view videos. International Journal of Computer Vision 134(1), doi:10.1007/s11263-025-02604-2
-
[41]
Advances in Neural Information Processing Systems 37:65618--65642
Wang W, Yang Y (2024) Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Advances in Neural Information Processing Systems 37:65618--65642
2024
-
[42]
In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 22922--22932
Wang Y, Guo J, Xie X, He T, Sun X, Bian J (2025) Vidtwin: Video vae with decoupled structure and dynamics. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 22922--22932
2025
-
[43]
IEEE transactions on image processing 13(4):600--612
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4):600--612
2004
-
[44]
Advances in Neural Information Processing Systems 37:68082--68119
Wu J, Yin S, Feng N, He X, Li D, Hao J, Long M (2024) ivideogpt: Interactive videogpts are scalable world models. Advances in Neural Information Processing Systems 37:68082--68119
2024
-
[45]
International Journal of Computer Vision 131(10):2699--2722, doi:10.1007/s11263-023-01832-8
Xu X, Wang Y, Wang L, Yu B, Jia J (2023) Conditional temporal variational autoencoder for action video prediction. International Journal of Computer Vision 131(10):2699--2722, doi:10.1007/s11263-023-01832-8
-
[46]
arXiv preprint arXiv:241008368
Yan W, Mnih V, Faust A, Zaharia M, Abbeel P, Liu H (2024) Elastictok: Adaptive tokenization for image and video. arXiv preprint arXiv:241008368
2024
-
[47]
International Journal of Computer Vision 129(5):1451--1466, doi:10.1007/s11263-020-01429-5
Yang C, Shen Y, Zhou B (2021) Semantic hierarchy emerges in deep generative representations for scene synthesis. International Journal of Computer Vision 129(5):1451--1466, doi:10.1007/s11263-020-01429-5
-
[48]
generation: Taming optimization dilemma in latent diffusion models
Yao J, Yang B, Wang X (2025) Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 15703--15712
2025
-
[49]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 22888--22897
Yoo J, Kim S, Lee D, Kim C, Hong S (2023) Towards end-to-end generative modeling of long videos with memory-efficient bidirectional transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 22888--22897
2023
-
[50]
Advances in Neural Information Processing Systems 37:128940--128966
Yu Q, Weber M, Deng X, Shen X, Cremers D, Chen LC (2024 a ) An image is worth 32 tokens for reconstruction and generation. Advances in Neural Information Processing Systems 37:128940--128966
2024
-
[51]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18456--18466
Yu S, Sohn K, Kim S, Shin J (2023) Video probabilistic diffusion models in projected latent space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18456--18466
2023
-
[52]
arXiv preprint arXiv:240314148
Yu S, Nie W, Huang DA, Li B, Shin J, Anandkumar A (2024 b ) Efficient video diffusion models via content-frame motion-latent decomposition. arXiv preprint arXiv:240314148
2024
-
[53]
arXiv preprint arXiv:250212632
Yu S, Hahn M, Kondratyuk D, Shin J, Gupta A, Lezama J, Essa I, Ross D, Huang J (2025) Malt diffusion: Memory-augmented latent transformers for any-length video generation. arXiv preprint arXiv:250212632
2025
-
[54]
In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586--595
Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586--595
2018
-
[55]
International Journal of Computer Vision 133(7):4909--4922, doi:10.1007/s11263-025-02413-7
Zhang Y, Wang X, Chen H, Qin C, Hao Y, Mei H, Zhu W (2025) Scenariodiff: Text-to-video generation with dynamic transformations of scene conditions. International Journal of Computer Vision 133(7):4909--4922, doi:10.1007/s11263-025-02413-7
-
[56]
Advances in Neural Information Processing Systems 37:12847--12871
Zhao S, Zhang Y, Cun X, Yang S, Niu M, Li X, Hu W, Shan Y (2024) Cv-vae: A compatible video vae for latent generative video models. Advances in Neural Information Processing Systems 37:12847--12871
2024
-
[57]
In: European conference on computer vision, Springer, pp 55--72
Zheng W, Chen W, Huang Y, Zhang B, Duan Y, Lu J (2024 a ) Occworld: Learning a 3d occupancy world model for autonomous driving. In: European conference on computer vision, Springer, pp 55--72
2024
-
[58]
arXiv preprint arXiv:241220404
Zheng Z, Peng X, Yang T, Shen C, Li S, Liu H, Zhou Y, Li T, You Y (2024 b ) Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:241220404
2024
-
[59]
FirstName Alpher , title =
-
[60]
Journal of Foo , volume = 13, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =
-
[61]
Journal of Foo , volume = 14, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =
-
[62]
FirstName Alpher and FirstName Gamow , title =
-
[63]
Computer Vision -- ECCV 2022 , year =
2022
-
[64]
2025 , url=
Zhong, Tianxiong and Tian, Xingye and Jiang, Boyuan and Wang, Xuebo and Tao, Xin and Wan, Pengfei and Zhang, Zhiwei , booktitle=. 2025 , url=
2025
-
[65]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[66]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
-
[67]
2016 , publisher=
Deep learning , author=. 2016 , publisher=
2016
-
[68]
European Conference on Computer Vision , pages=
Make-your-3d: Fast and consistent subject-driven 3d content generation , author=. European Conference on Computer Vision , pages=. 2024 , organization=
2024
-
[69]
arXiv preprint arXiv:2412.09600 , year=
Owl-1: Omni world model for consistent long video generation , author=. arXiv preprint arXiv:2412.09600 , year=
-
[70]
arXiv preprint arXiv:2408.12601 , year=
Dreamcinema: Cinematic transfer with free camera and 3d character , author=. arXiv preprint arXiv:2408.12601 , year=
-
[71]
European conference on computer vision , pages=
Occworld: Learning a 3d occupancy world model for autonomous driving , author=. European conference on computer vision , pages=. 2024 , organization=
2024
-
[72]
arXiv preprint arXiv:2506.10981 , year=
SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis , author=. arXiv preprint arXiv:2506.10981 , year=
-
[73]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[74]
arXiv preprint arXiv:2311.15127 , year=
Stable video diffusion: Scaling latent video diffusion models to large datasets , author=. arXiv preprint arXiv:2311.15127 , year=
-
[75]
International Journal of Computer Vision , volume=
ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions , author=. International Journal of Computer Vision , volume=. 2025 , doi=
2025
-
[76]
International Journal of Computer Vision , volume=
GuidedVDM: Controllable Video Generation with Long-Term Consistency , author=. International Journal of Computer Vision , volume=. 2026 , doi=
2026
-
[77]
International Journal of Computer Vision , volume=
Semantic Hierarchy Emerges in Deep Generative Representations for Scene Synthesis , author=. International Journal of Computer Vision , volume=. 2021 , doi=
2021
-
[78]
International Journal of Computer Vision , volume=
Conditional Temporal Variational AutoEncoder for Action Video Prediction , author=. International Journal of Computer Vision , volume=. 2023 , doi=
2023
-
[79]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Animate anyone: Consistent and controllable image-to-video synthesis for character animation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[80]
arXiv preprint arXiv:2506.10962 , year=
SpectralAR: Spectral Autoregressive Visual Generation , author=. arXiv preprint arXiv:2506.10962 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.