Recognition: 2 theorem links
· Lean TheoremStream-T1: Test-Time Scaling for Streaming Video Generation
Pith reviewed 2026-05-08 18:11 UTC · model grok-4.3
The pith
Stream-T1 shows that test-time scaling works efficiently for video generation when applied chunk-by-chunk in a streaming setup rather than to full sequences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Stream-T1 is a test-time scaling framework for streaming video generation built around three units: Stream-Scaled Noise Propagation refines the current chunk's initial noise using high-quality noise from prior chunks to establish temporal dependency; Stream-Scaled Reward Pruning scores candidates by balancing immediate spatial quality with sliding-window temporal coherence; and Stream-Scaled Memory Sinking routes evicted KV-cache context into update pathways guided by reward signals so past visuals anchor future frames. On 5-second and 30-second benchmarks this yields better temporal consistency, motion smoothness, and frame quality than prior methods.
What carries the argument
Three Stream-Scaled units (noise propagation from prior chunks, reward pruning across short- and long-term windows, and reward-guided memory sinking) that together turn chunk-level few-step synthesis into an efficient test-time scaling regime.
If this is right
- Temporal dependency can be injected at test time by reusing proven prior-chunk noise instead of sampling fresh noise for every segment.
- Candidate selection can trade off local frame aesthetics against global video coherence by combining immediate and sliding-window rewards.
- KV-cache memory can be dynamically updated based on reward feedback so evicted context continues to guide later chunks without uniform overwriting.
Where Pith is reading between the lines
- The same chunk-wise scaling pattern could be tested on other autoregressive generation tasks such as long audio or conditional image sequences.
- If the overhead of the three units stays low, the method might allow longer generated videos without proportional increases in training data or model size.
- A direct measurement of wall-clock time versus quality gain on videos longer than 30 seconds would clarify whether the streaming advantage scales.
Load-bearing premise
That chunk-level synthesis with few denoising steps is naturally suited to test-time scaling and that the three units can be combined without creating new instabilities or excessive overhead.
What would settle it
A controlled experiment in which the same three units are applied to non-streaming full-video diffusion and produce equal or lower quality at higher total cost than standard test-time methods would falsify the claimed advantage of the streaming shift.
read the original abstract
While Test-Time Scaling (TTS) offers a promising direction to enhance video generation without the surging costs of training, current test-time video generation methods based on diffusion models suffer from exorbitant candidate exploration costs and lack temporal guidance. To address these structural bottlenecks, we propose shifting the focus to streaming video generation. We identify that its chunk-level synthesis and few denoising steps are intrinsically suited for TTS, significantly lowering computational overhead while enabling fine-grained temporal control. Driven by this insight, we introduced Stream-T1, a pioneering comprehensive TTS framework exclusively tailored for streaming video generation. Specifically, Stream-T1 is composed of three units: (1) Stream -Scaled Noise Propagation, which actively refines the initial latent noise of the generating chunk using historically proven, high-quality previous chunk noise, effectively establishes temporal dependency and utilizing the historical Gaussian prior to guide the current generation; (2) Stream -Scaled Reward Pruning, which comprehensively evaluates generated candidates to strike an optimal balance between local spatial aesthetics and global temporal coherence by integrating immediate short-term assessments with sliding-window-based long-term evaluations; (3) Stream-Scaled Memory Sinking, which dynamically routes the context evicted from KV-cache into distinct updating pathways guided by the reward feedback, ensuring that previously generated visual information effectively anchors and guides the subsequent video stream. Evaluated on both 5s and 30s comprehensive video benchmarks, Stream-T1 demonstrates profound superiority, significantly improving temporal consistency, motion smoothness, and frame-level visual quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce Stream-T1, a comprehensive test-time scaling framework for streaming video generation. By focusing on chunk-level synthesis and few denoising steps, it proposes three units: Stream-Scaled Noise Propagation to refine initial latent noise using historical high-quality noise, Stream-Scaled Reward Pruning to balance local spatial aesthetics and global temporal coherence using short-term and sliding-window long-term evaluations, and Stream-Scaled Memory Sinking to route evicted KV-cache context based on reward feedback. The framework is evaluated on 5s and 30s video benchmarks, claiming profound superiority in temporal consistency, motion smoothness, and frame-level visual quality.
Significance. Should the quantitative results support the claims, this could represent a meaningful contribution to efficient test-time scaling in video diffusion models by exploiting the streaming paradigm to achieve better temporal control with lower overhead. The engineering of the three units to address specific bottlenecks in candidate exploration and temporal guidance is a targeted approach.
major comments (2)
- [Abstract] The abstract states that Stream-T1 'demonstrates profound superiority' on benchmarks but provides no quantitative numbers, error bars, baseline comparisons, or details on how candidates are generated and scored. This makes the central performance claim unverifiable from the given information and is load-bearing for the paper's main assertion.
- [Proposed Units] The integration of the three units is asserted to enable stable low-overhead TTS without new instabilities, but no analysis, ablations, or discussion of potential error accumulation (e.g., in noise propagation over 30s sequences or reward feedback loops) is evident. This is load-bearing for the claim that chunk-level synthesis is intrinsically suited for TTS.
minor comments (1)
- [Abstract] Inconsistent hyphenation in 'Stream -Scaled' (space before hyphen in some places); should be standardized to 'Stream-Scaled' throughout.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions made to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] The abstract states that Stream-T1 'demonstrates profound superiority' on benchmarks but provides no quantitative numbers, error bars, baseline comparisons, or details on how candidates are generated and scored. This makes the central performance claim unverifiable from the given information and is load-bearing for the paper's main assertion.
Authors: We agree that the abstract would benefit from greater specificity to make the performance claims more immediately verifiable. In the revised manuscript, we have updated the abstract to include concise quantitative highlights drawn from our experimental results (e.g., relative gains in temporal consistency and visual quality metrics versus baselines), while preserving brevity. Full details on candidate generation, scoring, error bars, and baseline comparisons are already present in Sections 3 and 4; the abstract revision simply surfaces the most salient numbers for readers. revision: yes
-
Referee: [Proposed Units] The integration of the three units is asserted to enable stable low-overhead TTS without new instabilities, but no analysis, ablations, or discussion of potential error accumulation (e.g., in noise propagation over 30s sequences or reward feedback loops) is evident. This is load-bearing for the claim that chunk-level synthesis is intrinsically suited for TTS.
Authors: We acknowledge that an explicit analysis of stability and error accumulation would strengthen the argument that chunk-level synthesis is particularly well-suited for test-time scaling. In the revised manuscript we have added a new subsection (with accompanying ablations and figures) that examines noise propagation drift and reward-feedback loop behavior across 5 s and 30 s sequences. The added experiments show that the memory-sinking mechanism prevents measurable accumulation of errors, thereby supporting the original claim without introducing new instabilities. revision: yes
Circularity Check
No significant circularity; empirical engineering framework with no derivation chain
full rationale
The paper introduces Stream-T1 as a test-time scaling framework for streaming video generation, consisting of three descriptive units (noise propagation, reward pruning, memory sinking) motivated by the suitability of chunk-level synthesis. No equations, first-principles derivations, fitted parameters, or predictions are presented that could reduce to inputs by construction. Central claims rest on empirical benchmark evaluations (5s/30s videos) rather than any self-referential logic, self-citation load-bearing, or ansatz smuggling. The contribution is therefore self-contained as an engineering proposal without circular reductions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Foundation/BranchSelection.leanbranch_selection (no contact: VP-SDE interpolation, not RCL bilinear coupling) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
x_T^n = β x_T^{n−1} + √(1−β²) ε, ε ~ N(0,I) ... this interpolation guarantees that the marginal distribution of the noise remains strictly invariant, consistently adhering to the standard isotropic Gaussian N(0,I).
-
IndisputableMonolith (whole framework)reality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Stream-T1 is composed of three units: Stream-Scaled Noise Propagation, Stream-Scaled Reward Pruning, Stream-Scaled Memory Sinking
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Reasoning on a budget: A survey of adaptive and controllable test-time compute in llms
Mohammad Ali Alomrani, Yingxue Zhang, Derek Li, Qianyi Sun, Soumyasundar Pal, Zhanguang Zhang, Yaochen Hu, Rohan Deepak Ajwani, Antonios Valkanas, Raika Karimi, et al. Reasoning on a budget: A survey of adaptive and controllable test-time compute in llms.arXiv preprint arXiv:2507.02076, 2025
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review arXiv 2023
-
[3]
SkyReels-V2: Infinite-length Film Generative Model
Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025
work page internal anchor Pith review arXiv 2025
-
[4]
Autoregressive Video Gen- eration Without Vector Quantization
Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024
-
[5]
arXiv preprint arXiv:2411.16375 (2024)
Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao, and Long Chen. Ca2-vdm: Efficient autoregressive video diffusion model with causal generation and cache sharing.arXiv preprint arXiv:2411.16375, 2024
-
[6]
Long-context autoregressive video modeling with next-frame prediction
Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction. arXiv preprint arXiv:2503.19325, 2025
-
[7]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081): 633–638, 2025
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081): 633–638, 2025
2025
-
[8]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review arXiv 2023
-
[9]
Haoran He, Jiajun Liang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Ling Pan. Scaling image and video generation via test-time evolutionary search.arXiv preprint arXiv:2505.17618, 2025
-
[10]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025
work page internal anchor Pith review arXiv 2025
-
[11]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
2024
-
[12]
Vbench++: Comprehensive and versatile benchmark suite for video generative models
Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[13]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Longbin Ji, Xiaoxiong Liu, Junyuan Shang, Shuohuan Wang, Yu Sun, Hua Wu, and Haifeng Wang. Videoar: Autoregressive video generation via next-frame & scale prediction.arXiv preprint arXiv:2601.05966, 2026
-
[15]
Compositional image synthesis with inference-time scaling
Minsuk Ji, Sanghyeok Lee, and Namhyuk Ahn. Compositional image synthesis with inference-time scaling. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4441–4445. IEEE, 2026
2026
-
[16]
Videopoet: A large language model for zero-shot video generation.arXiv:2312.14125,
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023
-
[17]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 12
work page Pith review arXiv 2024
-
[18]
Haodong Li, Shaoteng Liu, Zhe Lin, and Manmohan Chandraker. Rolling sink: Bridging limited-horizon training and open-ended testing in autoregressive video diffusion.arXiv preprint arXiv:2602.07775, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Reflect-dit: Inference-time scaling for text-to-image diffusion transformers via in-context reflection
Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Reflect-dit: Inference-time scaling for text-to-image diffusion transformers via in-context reflection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15657–15668, 2025
2025
-
[20]
Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024
-
[21]
Video-t1: Test-time scaling for video generation
Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, and Yueqi Duan. Video-t1: Test-time scaling for video generation. InProceedings of the IEEE/CVF international conference on computer vision, pages 18671–18681, 2025
2025
-
[22]
Improving Video Generation with Human Feedback
Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025
work page internal anchor Pith review arXiv 2025
-
[23]
Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025
-
[24]
Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.arXiv preprint arXiv:2502.06703, 2025
-
[25]
Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678, 2025
-
[26]
Hpsv3: Towards wide-spectrum human preference score
Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025
2025
-
[27]
s1: Simple test-time scaling
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332, 2025
2025
-
[28]
Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo, and Hiroki Furuta. Inference-time text-to-video alignment with diffusion latent beam search.arXiv preprint arXiv:2501.19252, 2025
-
[29]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024
work page internal anchor Pith review arXiv 2024
-
[30]
Hierarchical spatio-temporal decoupling for text-to-video generation
Zhiwu Qing, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yujie Wei, Yingya Zhang, Changxin Gao, and Nong Sang. Hierarchical spatio-temporal decoupling for text-to-video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6635–6645, 2024
2024
-
[31]
Vignav Ramesh and Morteza Mardani. Test-time scaling of diffusion models via noise trajectory search.arXiv preprint arXiv:2506.03164, 2025
-
[32]
arXiv preprint arXiv:2502.07737 (2025)
Shuhuai Ren, Shuming Ma, Xu Sun, and Furu Wei. Next block prediction: Video generation via semi-autoregressive modeling. arXiv preprint arXiv:2502.07737, 2025
-
[33]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022
work page internal anchor Pith review arXiv 2022
-
[34]
arXiv preprint arXiv:2501.06848 (2025)
Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025
-
[35]
Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning
Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025. 13
2025
-
[36]
MAGI-1: Autoregressive Video Generation at Scale
Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025
work page internal anchor Pith review arXiv 2025
-
[37]
Wan: Open and Advanced Large-Scale Video Generative Models
Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 3(4):6, 2025
work page Pith review arXiv 2025
-
[38]
ModelScope Text-to-Video Technical Report
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023
work page internal anchor Pith review arXiv 2023
-
[39]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page Pith review arXiv 2024
-
[40]
Art-v: Auto-regressive text-to-video generation with diffusion models
Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, et al. Art-v: Auto-regressive text-to-video generation with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7395–7405, 2024
2024
-
[41]
Imagerysearch: Adaptive test-time search for video generation beyond semantic dependency constraints
Meiqi Wu, Jiashu Zhu, Xiaokun Feng, Chubin Chen, Chen Zhu, Bingze Song, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, and Kaiqi Huang. Imagerysearch: Adaptive test-time search for video generation beyond semantic dependency constraints. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 10700–10708, 2026
2026
-
[42]
Imagereward: Learning and evaluating human preferences for text-to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023
2023
-
[43]
Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation
Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11269–11277, 2026
2026
-
[44]
VideoGPT: Video Generation using VQ-VAE and Transformers
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021
work page internal anchor Pith review arXiv 2021
-
[45]
LongLive: Real-time Interactive Long Video Generation
Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying- cong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025
work page internal anchor Pith review arXiv 2025
-
[46]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review arXiv 2024
-
[47]
From slow bidirectional to fast autoregressive video diffusion models
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025
2025
-
[48]
Hu Yu, Biao Gong, Hangjie Yuan, DanDan Zheng, Weilong Chai, Jingdong Chen, Kecheng Zheng, and Feng Zhao. Videomar: Autoregressive video generatio with continuous tokens.arXiv preprint arXiv:2506.14168, 2025
-
[49]
Lumos-1: On autoregressive video generation with discrete diffusion from a unified model perspective
Hangjie Yuan, Weihua Chen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, Hao Luo, et al. Lumos-1: On autoregressive video generation with discrete diffusion from a unified model perspective. In The FourteenthInternational Conference on Learning Representations
-
[50]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106, 2025
work page internal anchor Pith review arXiv 2025
-
[51]
Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Computer Vision, 133(4):1879–1893, 2025
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Computer Vision, 133(4):1879–1893, 2025
2025
-
[52]
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Zhihan Guo, Yufei Wang, Irwin King, Xue Liu, and Chen Ma. What, how, where, and how well? a survey on test-time scaling in large language models.arXiv preprint arXiv:2503.24235, 2025. 14
work page internal anchor Pith review arXiv 2025
-
[53]
Learning multi-dimensional human preference for text-to-image generation
Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, and Zhongyuan Wang. Learning multi-dimensional human preference for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8018–8027, 2024
2024
-
[54]
Latsearch: Latent reward-guided search for faster inference-time scaling in video diffusion, 2026
Zengqun Zhao, Ziquan Liu, Yu Cao, Shaogang Gong, Zhensong Zhang, Jifei Song, Jiankang Deng, and Ioannis Patras. Latsearch: Latent reward-guided search for faster inference-time scaling in video diffusion, 2026. URL https://arxiv.org/abs/2603.14526
-
[55]
Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models.arXiv preprint arXiv:2211.11018, 2022
-
[56]
Golden noise for diffusion models: A learning framework
Zikai Zhou, Shitong Shao, Lichen Bai, Shufei Zhang, Zhiqiang Xu, Bo Han, and Zeke Xie. Golden noise for diffusion models: A learning framework. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17688–17697, 2025
2025
-
[57]
From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning
Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, Peng Gao, Mohamed Elhoseiny, and Hongsheng Li. From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15329–15339, 2025. 15
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.