Recognition: 2 theorem links
· Lean TheoremMAGI-1: Autoregressive Video Generation at Scale
Pith reviewed 2026-05-13 20:24 UTC · model grok-4.3
The pith
MAGI-1 generates videos by autoregressively predicting fixed-length chunks of frames.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MAGI-1 is trained to denoise per-chunk noise that increases monotonically over time, enabling it to generate videos autoregressively as sequences of fixed-length frame segments. This produces causal temporal modeling, supports streaming generation, and maintains constant peak inference cost independent of video length. The approach achieves strong performance on image-to-video tasks with text instructions and scales to 24 billion parameters with up to 4 million token contexts.
What carries the argument
Chunk-wise autoregressive prediction where each video chunk is denoised with noise levels increasing monotonically over successive chunks.
If this is right
- Chunk-wise prompting allows controllable generation of video segments.
- Real-time deployment is possible with constant peak inference cost for any video length.
- High temporal consistency emerges naturally from the monotonic noise schedule.
- The model supports context lengths up to 4 million tokens in a 24B parameter setup.
Where Pith is reading between the lines
- Such chunk-based autoregression could extend naturally to interactive video editing by allowing mid-stream prompt changes.
- This method might lower the barrier for training even larger video world models by reducing memory demands during long-sequence inference.
- The streaming capability suggests applications in real-time simulation or augmented reality environments.
Load-bearing premise
Training on monotonically increasing per-chunk noise alone produces sufficient causal temporal modeling and consistency without needing extra architectural constraints or post-processing.
What would settle it
Observe whether long generated videos exhibit frame-to-frame inconsistencies or drift when starting from a single image and text prompt, especially beyond the training chunk lengths.
read the original abstract
We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 facilitates controllable generation via chunk-wise prompting and supports real-time, memory-efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI-1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating the scalability and robustness of our approach. The code and models are available at https://github.com/SandAI-org/MAGI-1 and https://github.com/SandAI-org/MagiAttention. The product can be accessed at https://sand.ai.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MAGI-1, a 24B-parameter autoregressive world model for video generation that predicts fixed-length chunks of consecutive frames sequentially. Training denoises per-chunk noise that increases monotonically over time, which the authors state enables causal temporal modeling, streaming generation, and high temporal consistency for text-conditioned image-to-video tasks. The model supports chunk-wise prompting for controllability, maintains constant peak inference cost independent of video length, and scales to context lengths of 4 million tokens. Code, models, and the MagiAttention library are released publicly.
Significance. If the performance and causality claims hold under rigorous evaluation, the work would represent a meaningful step toward scalable autoregressive video world models, particularly for long-context streaming generation with memory-efficient inference. The public release of code and models strengthens reproducibility and community impact. However, the absence of any quantitative metrics, baselines, or ablations in the presented material substantially weakens the ability to gauge whether the monotonic noise schedule delivers the claimed advantages over standard diffusion or autoregressive approaches.
major comments (3)
- [Abstract] Abstract: The central claims of 'strong performance' and 'high temporal consistency' on I2V tasks are asserted without any reported quantitative metrics, baseline comparisons, ablation studies, or error analysis. This omission is load-bearing because the manuscript's value rests on demonstrating that the chunk-wise autoregressive approach with monotonic noise outperforms existing methods.
- [Abstract] Training description (Abstract and implied methods): The claim that denoising per-chunk noise increasing monotonically over time is sufficient to produce causal temporal modeling lacks any mention of causal attention masking, chunk-wise attention restrictions, or ablations isolating the noise schedule's contribution. Without these, bidirectional attention within the transformer could permit future-frame leakage, directly undermining the causality and consistency assertions.
- [Scalability claims] Scalability section: The statement that the 24B model supports 4 million token contexts with constant inference cost is presented without scaling curves, memory profiling, or empirical results on long video sequences. This detail is critical to the scalability narrative and requires concrete evidence to support.
minor comments (2)
- [Abstract] The abstract refers to 'several algorithmic innovations' and 'a dedicated infrastructure stack' without naming them; these should be enumerated early in the introduction or methods for clarity.
- Consider adding a related-work section that explicitly contrasts the monotonic per-chunk noise schedule against prior autoregressive video models and diffusion-based I2V methods.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We address each major comment below and commit to revisions that add the requested quantitative evidence, clarifications, and empirical support.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of 'strong performance' and 'high temporal consistency' on I2V tasks are asserted without any reported quantitative metrics, baseline comparisons, ablation studies, or error analysis. This omission is load-bearing because the manuscript's value rests on demonstrating that the chunk-wise autoregressive approach with monotonic noise outperforms existing methods.
Authors: We agree that the abstract would be strengthened by quantitative support. The main body contains benchmark results on I2V tasks, but we will revise the abstract to include key metrics (e.g., FVD and temporal consistency scores) and a brief reference to baseline comparisons and ablations on the noise schedule. revision: yes
-
Referee: [Abstract] Training description (Abstract and implied methods): The claim that denoising per-chunk noise increasing monotonically over time is sufficient to produce causal temporal modeling lacks any mention of causal attention masking, chunk-wise attention restrictions, or ablations isolating the noise schedule's contribution. Without these, bidirectional attention within the transformer could permit future-frame leakage, directly undermining the causality and consistency assertions.
Authors: The monotonic noise schedule is intended to encourage causality through sequential chunk prediction with progressively higher noise on future chunks. To address the concern about potential leakage, we will add an explicit description of the causal attention masking used in the transformer and include ablations that isolate the noise schedule's contribution versus standard uniform noise. revision: yes
-
Referee: [Scalability claims] Scalability section: The statement that the 24B model supports 4 million token contexts with constant inference cost is presented without scaling curves, memory profiling, or empirical results on long video sequences. This detail is critical to the scalability narrative and requires concrete evidence to support.
Authors: We will expand the scalability section to include scaling curves for context length up to 4M tokens, memory usage profiles, and empirical results on long video sequences that demonstrate constant peak inference cost. revision: yes
Circularity Check
No significant circularity; performance claims rest on architectural description
full rationale
The paper presents MAGI-1 as an autoregressive chunk-wise video generator trained with monotonically increasing per-chunk noise, claiming this enables causal temporal modeling and high consistency. No equations, derivations, or fitted parameters are shown that reduce the claimed I2V performance or causality to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the abstract or description. The approach is self-contained against external benchmarks via reported scalability and empirical results.
Axiom & Free-Parameter Ledger
free parameters (2)
- chunk length
- noise increase schedule
axioms (1)
- domain assumption Sequential prediction of denoised chunks produces temporally consistent video without additional consistency losses or post-processing.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Trained to denoise per-chunk noise that increases monotonically over time... block-causal attention mask enforces temporal causality across chunks
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MAGI-1 employs full attention within each chunk and causal attention across chunks... 24-frame chunks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 30 Pith papers
-
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
-
PhysInOne: Visual Physics Learning and Reasoning in One Suite
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and m...
-
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
-
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
-
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
-
Envisioning the Future, One Step at a Time
An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.
-
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
-
Unified Vector Floorplan Generation via Markup Representation
A single transformer model using a new markup representation generates functional floorplans from diverse conditions and outperforms prior task-specific methods on the RPLAN dataset.
-
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
-
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
-
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.
-
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
-
Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
Forcing-KV applies head-specific static and dynamic pruning to KV caches in AR video diffusion models, achieving over 29 fps, 30% memory reduction, and up to 2.82x speedup at maintained quality.
-
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
-
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
SwiftI2V achieves comparable 2K I2V quality to end-to-end models on VBench-I2V while cutting GPU time by 202x through low-resolution motion planning followed by strongly image-conditioned segment-wise high-resolution ...
-
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
SwiftI2V matches end-to-end 2K I2V quality on VBench while cutting GPU time by 202x via conditional segment-wise generation that bounds token cost and preserves input fidelity.
-
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
-
Stream-T1: Test-Time Scaling for Streaming Video Generation
Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
-
Motion-Aware Caching for Efficient Autoregressive Video Generation
MotionCache accelerates autoregressive video generation up to 6.28x by motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on SkyReels-V2 and MAGI-1.
-
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
-
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
-
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and ...
-
MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation
MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
-
Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation
Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.
-
Pretrained Video Models as Differentiable Physics Simulators for Urban Wind Flows
WinDiNet repurposes a 2B-parameter video diffusion model as a differentiable surrogate that generates 112-frame urban wind flow rollouts in under one second and enables direct gradient optimization of building positions.
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
-
LongLive: Real-time Interactive Long Video Generation
LongLive is a causal autoregressive video generator that produces up to 240-second interactive videos at 20.7 FPS on one H100 GPU after 32 GPU-days of fine-tuning from a 1.3B short-clip model.
-
Video Generation with Predictive Latents
PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.
-
Motion-Aware Caching for Efficient Autoregressive Video Generation
MotionCache speeds up autoregressive video generation by 6.28x on SkyReels-V2 and 1.64x on MAGI-1 via motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on VBench.
-
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...
Reference graph
Works this paper leans on
-
[1]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Building Normalizing Flows with Stochastic Interpolants
Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571,
work page internal anchor Pith review Pith/arXiv arXiv
- [3]
-
[4]
Lumiere: A space-time diffusion model for video generation
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. In SIGGRAPH Asia 2024 Conference Papers, pp. 1–11,
work page 2024
-
[5]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
39 Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vin- cent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems, 37:24081–24125, 2024a. Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchu...
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, and Song Han. Longvila: Scaling long-context visual language models for long videos, 2024b. URL https://arxiv.org/abs/2408.10188. PyTorch Contributors. Pyt...
-
[8]
Cophy: A scalable, portable, and interactive index advisor for large workloads
Debabrata Dash, Neoklis Polyzotis, and Anastasia Ailamaki. Cophy: A scalable, portable, and interactive index advisor for large workloads. arXiv preprint arXiv:1104.3214,
-
[9]
Flex Attention: A Programming Model for Generating Optimized Attention Kernels
URL https: //arxiv.org/abs/2412.05496. György Dósa. The tight bound of first fit decreasing bin-packing algorithm is ffd(i) <= 11/9opt(i) + 6/9. ESCAPE’07, pp. 1–11, Berlin, Heidelberg,
-
[10]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Springer-Verlag. ISBN 3540744495. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[11]
40 Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel
URL https://arxiv.org/abs/2405.07719. 40 Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. arXiv preprint arXiv:2410.12557,
- [12]
-
[13]
URL https://arxiv.org/abs/2406.18485. Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning. International Conference on Learning Representations,
-
[14]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
arXiv preprint arXiv:2205.10487 , year=
Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El- Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, et al. Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487,
-
[16]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Vbench++: Comprehensive and versatile benchmark suite for video generative models
URL https://arxiv.org/ abs/2411.13503. Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimiza- tions for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509,
-
[19]
URL https://arxiv.org/pdf/2309.14509. 41 Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Auto-Encoding Variational Bayes
Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Videopoet: A large language model for zero-shot video generation.arXiv:2312.14125,
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125,
-
[22]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
URL https: //arxiv.org/abs/2407.09105. Black Forest Labs. Flux. https://github.com/black-forest-labs/flux,
-
[24]
Deduplicating training data makes language models better
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499,
-
[25]
June 7, 2025.DOI:10.48550/arXiv.2410.06511
URL https://arxiv.org/abs/2410.06511. Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131,
-
[26]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Ring Attention with Blockwise Transformers for Near-Infinite Context
Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022a. Yaofang Liu, Yumeng Ren, Xiaodong Cun, Aitor Artola, Yang Liu, Tieyong Zeng, Ray- mond H Chan, and Jean-michel Morel. Redefining temporal modeling in video diffusion: The vectorized timeste...
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Knowledge distillation in iterative generative models for improved sampling speed
Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pp. 12009–12019, 2022b. Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative...
-
[30]
Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v tech- nical report: The practice, challenges, and future of video foundation model. arXiv preprint arXiv:2502.10248,
-
[31]
Latte: Latent Diffusion Transformer for Video Generation
42 Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048,
work page internal anchor Pith review arXiv
-
[32]
Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation. arXiv preprint arXiv:2410.05363,
- [33]
-
[34]
Accessed: 2024-12-12. OpenAI. Video generation models as world simulators,
work page 2024
-
[35]
DINOv2: Learning Robust Visual Features without Supervision
URL https://openai.com/ index/video-generation-models-as-world-simulators/ . Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khali- dov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choud- hary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Si...
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He
URL https://arxiv.org/ abs/1910.02054. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234–241. Springer,
-
[38]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024
URL https://arxiv.org/abs/2407.08608. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202,
-
[40]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456,
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[41]
Reproducibility — pytorch 2.6 documentation
PyTorch Team. Reproducibility — pytorch 2.6 documentation. https://pytorch.org/docs/ stable/notes/randomness.html, 2024a. PyTorch Team. torch.distributed.tensor — pytorch 2.6 documentation. https://pytorch. org/docs/stable/distributed.tensor.html, 2024b. Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vi...
work page 2020
-
[42]
Wan: Open and Advanced Large-Scale Video Generative Models
Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025a. Guoxia Wang, Jinle Zeng, Xiyuan Xiao, Siming Wu, Jiabin Yang, Lujing Zheng, Zeyu Chen, Jiang Bian, Dianhai Yu, and Haifeng Wang. Fl...
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Sam Wiseman and Alexander M Rush
URL https://arxiv.org/abs/2412.01523. Sam Wiseman and Alexander M Rush. Sequence-to-sequence learning as beam-search optimization. arXiv preprint arXiv:1606.02960,
-
[44]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp. 38087–38099. PMLR, 2023a. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient stream- ing language models wit...
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
URL https://arxiv.org/ abs/2105.04663. Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
-
[46]
Clevrer: Collision events for video representation and reasoning
Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442,
-
[47]
From slow bidirectional to fast causal video generators
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shecht- man, and Xun Huang. From slow bidirectional to fast causal video generators. arXiv preprint arXiv:2412.07772,
-
[48]
45 Geng Zhang, Xuanlei Zhao, Kai Wang, and Yang You
URL https://arxiv.org/abs/ 2110.15032. 45 Geng Zhang, Xuanlei Zhao, Kai Wang, and Yang You. Training variable sequences with data-centric parallel,
-
[49]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277,
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
A Inference Infra A.1 W8A8 Quantization We adopt the A8W8 SmoothQuant approach (Xiao et al., 2023a), which leverages a cali- bration dataset to pre-compute per-channel scaling factors s. This enables an equivalent transformation of the form Y = ( X · diag(s)−1) · (diag(s)W), effectively mitigating the impact of outliers in channel-wise activations. For ca...
work page 2024
-
[51]
(see Fig.26), from which we sample to pack and pad to the required seqlen . To calculate the TFLOPs/s for various mask patterns during both forward and backward passes, we use the subsequent equations, following Flash-Attention (Dao, 2023): FLOPs( f wd) = 2|{z} 2 matmul × 2|{z} 2 flops per matmul × MaskArea(seqlen , mask_type ) (12) × batch _size × num_he...
work page 2023
-
[52]
2 , ... TFLOPs/s(wd) = FLOPs(wd) Runtime(wd), wd ∈ { f wd, bwd} (14) 0K-2K 2K-4K 4K-8K 8K-16K 16K-32K 32K-64K 64K-128K 128K-256K 256K-512K 0M-1M 1M-2M 2M-4M Sequence Length Range 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Probability 0.12 0.04 0.03 0.05 0.06 0.16 0.31 0.15 0.04 0.03 0.01 0.01 Variable Sequence Length Distribution Figure 26: Distribution of s...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.