arxiv: 2505.13211 · v1 · submitted 2025-05-19 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MAGI-1: Autoregressive Video Generation at Scale

Sand.ai , Hansi Teng , Hongyu Jia , Lei Sun , Lingzhi Li , Maolin Li , Mingqiu Tang , Shuai Han

show 31 more authors

Tianning Zhang W.Q. Zhang Weifeng Luo Xiaoyang Kang Yuchen Sun Yue Cao Yunpeng Huang Yutong Lin Yuxin Fang Zewei Tao Zheng Zhang Zhongshu Wang Zixun Liu Dai Shi Guoli Su Hanwen Sun Hong Pan Jie Wang Jiexin Sheng Min Cui Min Hu Ming Yan Shucheng Yin Siran Zhang Tingting Liu Xianping Yin Xiaoyu Yang Xin Song Xuan Hu Yankai Zhang Yuqiao Li

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:24 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords autoregressive video generationimage-to-videoworld modeltemporal consistencychunk-wise predictionscalable generation

0 comments

The pith

MAGI-1 generates videos by autoregressively predicting fixed-length chunks of frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MAGI-1, a world model for video generation that predicts sequences of video chunks autoregressively. It trains the model to denoise noise that increases monotonically from chunk to chunk, which supports causal temporal modeling and streaming output. This design delivers high temporal consistency in text-conditioned image-to-video generation and scales to context lengths of 4 million tokens with a 24 billion parameter model, while keeping inference memory constant regardless of video length.

Core claim

MAGI-1 is trained to denoise per-chunk noise that increases monotonically over time, enabling it to generate videos autoregressively as sequences of fixed-length frame segments. This produces causal temporal modeling, supports streaming generation, and maintains constant peak inference cost independent of video length. The approach achieves strong performance on image-to-video tasks with text instructions and scales to 24 billion parameters with up to 4 million token contexts.

What carries the argument

Chunk-wise autoregressive prediction where each video chunk is denoised with noise levels increasing monotonically over successive chunks.

If this is right

Chunk-wise prompting allows controllable generation of video segments.
Real-time deployment is possible with constant peak inference cost for any video length.
High temporal consistency emerges naturally from the monotonic noise schedule.
The model supports context lengths up to 4 million tokens in a 24B parameter setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such chunk-based autoregression could extend naturally to interactive video editing by allowing mid-stream prompt changes.
This method might lower the barrier for training even larger video world models by reducing memory demands during long-sequence inference.
The streaming capability suggests applications in real-time simulation or augmented reality environments.

Load-bearing premise

Training on monotonically increasing per-chunk noise alone produces sufficient causal temporal modeling and consistency without needing extra architectural constraints or post-processing.

What would settle it

Observe whether long generated videos exhibit frame-to-frame inconsistencies or drift when starting from a single image and text prompt, especially beyond the training chunk lengths.

read the original abstract

We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 facilitates controllable generation via chunk-wise prompting and supports real-time, memory-efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI-1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating the scalability and robustness of our approach. The code and models are available at https://github.com/SandAI-org/MAGI-1 and https://github.com/SandAI-org/MagiAttention. The product can be accessed at https://sand.ai.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Chunk-wise autoregressive video generation with monotonic per-chunk noise is the core new element in MAGI-1, delivering scalable long-context inference but requiring more quantitative validation.

read the letter

The main point to know is that MAGI-1 uses autoregressive prediction over fixed-length video chunks, trained with noise levels that increase monotonically from one chunk to the next. This design aims to support causal generation, streaming output, and constant memory usage even as videos get arbitrarily long. The 24 billion parameter version handles context up to 4 million tokens, which is a notable scale for this approach. The work does a few things well. It focuses on practical deployment with real-time, memory-efficient inference that does not grow with length. Chunk-wise prompting gives a simple way to control generation. Making the code, models, and even the attention implementation available is useful for the community. The combination of chunking and the noise schedule is a distinct choice compared to standard diffusion or full-sequence autoregressive video models. There are soft spots though. The abstract highlights strong performance and high temporal consistency on text-conditioned image-to-video tasks, but it does not include any specific metrics, comparisons to baselines, or ablation studies. Without those, it's difficult to gauge how much the new elements contribute. The concern about whether the noise schedule alone prevents information leakage is reasonable if the model uses standard attention without explicit causal masking within or between chunks. If the full paper shows the attention is restricted and includes controls isolating the schedule, that would address it directly. Otherwise the central claim rests more on description than demonstrated isolation. This paper is mainly for people building or studying generative video systems, especially those exploring autoregressive alternatives that scale to long outputs. Readers looking for engineering details on constant-cost inference or chunked control would find it relevant. I would send it for peer review. The idea has enough substance and the scale is ambitious enough to warrant referee input, even if the evidence needs bolstering in revision.

Referee Report

3 major / 2 minor

Summary. The manuscript presents MAGI-1, a 24B-parameter autoregressive world model for video generation that predicts fixed-length chunks of consecutive frames sequentially. Training denoises per-chunk noise that increases monotonically over time, which the authors state enables causal temporal modeling, streaming generation, and high temporal consistency for text-conditioned image-to-video tasks. The model supports chunk-wise prompting for controllability, maintains constant peak inference cost independent of video length, and scales to context lengths of 4 million tokens. Code, models, and the MagiAttention library are released publicly.

Significance. If the performance and causality claims hold under rigorous evaluation, the work would represent a meaningful step toward scalable autoregressive video world models, particularly for long-context streaming generation with memory-efficient inference. The public release of code and models strengthens reproducibility and community impact. However, the absence of any quantitative metrics, baselines, or ablations in the presented material substantially weakens the ability to gauge whether the monotonic noise schedule delivers the claimed advantages over standard diffusion or autoregressive approaches.

major comments (3)

[Abstract] Abstract: The central claims of 'strong performance' and 'high temporal consistency' on I2V tasks are asserted without any reported quantitative metrics, baseline comparisons, ablation studies, or error analysis. This omission is load-bearing because the manuscript's value rests on demonstrating that the chunk-wise autoregressive approach with monotonic noise outperforms existing methods.
[Abstract] Training description (Abstract and implied methods): The claim that denoising per-chunk noise increasing monotonically over time is sufficient to produce causal temporal modeling lacks any mention of causal attention masking, chunk-wise attention restrictions, or ablations isolating the noise schedule's contribution. Without these, bidirectional attention within the transformer could permit future-frame leakage, directly undermining the causality and consistency assertions.
[Scalability claims] Scalability section: The statement that the 24B model supports 4 million token contexts with constant inference cost is presented without scaling curves, memory profiling, or empirical results on long video sequences. This detail is critical to the scalability narrative and requires concrete evidence to support.

minor comments (2)

[Abstract] The abstract refers to 'several algorithmic innovations' and 'a dedicated infrastructure stack' without naming them; these should be enumerated early in the introduction or methods for clarity.
Consider adding a related-work section that explicitly contrasts the monotonic per-chunk noise schedule against prior autoregressive video models and diffusion-based I2V methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below and commit to revisions that add the requested quantitative evidence, clarifications, and empirical support.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 'strong performance' and 'high temporal consistency' on I2V tasks are asserted without any reported quantitative metrics, baseline comparisons, ablation studies, or error analysis. This omission is load-bearing because the manuscript's value rests on demonstrating that the chunk-wise autoregressive approach with monotonic noise outperforms existing methods.

Authors: We agree that the abstract would be strengthened by quantitative support. The main body contains benchmark results on I2V tasks, but we will revise the abstract to include key metrics (e.g., FVD and temporal consistency scores) and a brief reference to baseline comparisons and ablations on the noise schedule. revision: yes
Referee: [Abstract] Training description (Abstract and implied methods): The claim that denoising per-chunk noise increasing monotonically over time is sufficient to produce causal temporal modeling lacks any mention of causal attention masking, chunk-wise attention restrictions, or ablations isolating the noise schedule's contribution. Without these, bidirectional attention within the transformer could permit future-frame leakage, directly undermining the causality and consistency assertions.

Authors: The monotonic noise schedule is intended to encourage causality through sequential chunk prediction with progressively higher noise on future chunks. To address the concern about potential leakage, we will add an explicit description of the causal attention masking used in the transformer and include ablations that isolate the noise schedule's contribution versus standard uniform noise. revision: yes
Referee: [Scalability claims] Scalability section: The statement that the 24B model supports 4 million token contexts with constant inference cost is presented without scaling curves, memory profiling, or empirical results on long video sequences. This detail is critical to the scalability narrative and requires concrete evidence to support.

Authors: We will expand the scalability section to include scaling curves for context length up to 4M tokens, memory usage profiles, and empirical results on long video sequences that demonstrate constant peak inference cost. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claims rest on architectural description

full rationale

The paper presents MAGI-1 as an autoregressive chunk-wise video generator trained with monotonically increasing per-chunk noise, claiming this enables causal temporal modeling and high consistency. No equations, derivations, or fitted parameters are shown that reduce the claimed I2V performance or causality to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the abstract or description. The approach is self-contained against external benchmarks via reported scalability and empirical results.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that chunk-wise autoregressive denoising with monotonic noise increase produces causal consistency; several standard ML design choices (chunk size, noise schedule parameters, context length) function as free parameters whose values are not derived from first principles.

free parameters (2)

chunk length
Fixed length of consecutive frames per chunk is a hyperparameter chosen to balance modeling and efficiency.
noise increase schedule
Monotonic per-chunk noise level is a design parameter fitted or selected to enable the autoregressive property.

axioms (1)

domain assumption Sequential prediction of denoised chunks produces temporally consistent video without additional consistency losses or post-processing.
Invoked in the description of causal temporal modeling and high consistency.

pith-pipeline@v0.9.0 · 5619 in / 1220 out tokens · 36767 ms · 2026-05-13T20:24:41.013969+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Trained to denoise per-chunk noise that increases monotonically over time... block-causal attention mask enforces temporal causality across chunks
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MAGI-1 employs full attention within each chunk and causal attention across chunks... 24-frame chunks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
cs.CV 2026-05 unverdicted novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
PhysInOne: Visual Physics Learning and Reasoning in One Suite
cs.CV 2026-04 unverdicted novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and m...
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
cs.CV 2026-05 unverdicted novelty 7.0

FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
Envisioning the Future, One Step at a Time
cs.CV 2026-04 unverdicted novelty 7.0

An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
Unified Vector Floorplan Generation via Markup Representation
cs.CV 2026-04 unverdicted novelty 7.0

A single transformer model using a new markup representation generates functional floorplans from diverse conditions and outperforms prior task-specific methods on the RPLAN dataset.
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation
cs.CV 2026-04 conditional novelty 7.0

SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
cs.CV 2026-05 unverdicted novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

Forcing-KV applies head-specific static and dynamic pruning to KV caches in AR video diffusion models, achieving over 29 fps, 30% memory reduction, and up to 2.82x speedup at maintained quality.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
cs.CV 2026-05 unverdicted novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
cs.CV 2026-05 unverdicted novelty 6.0

SwiftI2V achieves comparable 2K I2V quality to end-to-end models on VBench-I2V while cutting GPU time by 202x through low-resolution motion planning followed by strongly image-conditioned segment-wise high-resolution ...
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
cs.CV 2026-05 unverdicted novelty 6.0

SwiftI2V matches end-to-end 2K I2V quality on VBench while cutting GPU time by 202x via conditional segment-wise generation that bounds token cost and preserves input fidelity.
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
cs.CV 2026-05 unverdicted novelty 6.0

RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
Stream-T1: Test-Time Scaling for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
Motion-Aware Caching for Efficient Autoregressive Video Generation
cs.CV 2026-05 conditional novelty 6.0

MotionCache accelerates autoregressive video generation up to 6.28x by motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on SkyReels-V2 and MAGI-1.
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
cs.CV 2026-04 unverdicted novelty 6.0

Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
cs.CV 2026-04 unverdicted novelty 6.0

RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and ...
MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation
cs.CV 2026-04 unverdicted novelty 6.0

MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.
Pretrained Video Models as Differentiable Physics Simulators for Urban Wind Flows
cs.LG 2026-03 unverdicted novelty 6.0

WinDiNet repurposes a 2B-parameter video diffusion model as a differentiable surrogate that generates 112-frame urban wind flow rollouts in under one second and enables direct gradient optimization of building positions.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
LongLive: Real-time Interactive Long Video Generation
cs.CV 2025-09 conditional novelty 6.0

LongLive is a causal autoregressive video generator that produces up to 240-second interactive videos at 20.7 FPS on one H100 GPU after 32 GPU-days of fine-tuning from a 1.3B short-clip model.
Video Generation with Predictive Latents
cs.CV 2026-05 unverdicted novelty 5.0

PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.
Motion-Aware Caching for Efficient Autoregressive Video Generation
cs.CV 2026-05 unverdicted novelty 5.0

MotionCache speeds up autoregressive video generation by 6.28x on SkyReels-V2 and 1.64x on MAGI-1 via motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on VBench.
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
cs.CV 2026-04 unverdicted novelty 4.0

Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 28 Pith papers · 22 internal anchors

[1]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Building Normalizing Flows with Stochastic Interpolants

Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Bansal, Z

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chen- fanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. arXiv preprint arXiv:2406.03520,

work page arXiv
[4]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. In SIGGRAPH Asia 2024 Conference Papers, pp. 1–11,

work page 2024
[5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

39 Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vin- cent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems, 37:24081–24125, 2024a. Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchu...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, and Song Han. Longvila: Scaling long-context visual language models for long videos, 2024b. URL https://arxiv.org/abs/2408.10188. PyTorch Contributors. Pyt...

work page arXiv
[8]

Cophy: A scalable, portable, and interactive index advisor for large workloads

Debabrata Dash, Neoklis Polyzotis, and Anastasia Ailamaki. Cophy: A scalable, portable, and interactive index advisor for large workloads. arXiv preprint arXiv:1104.3214,

work page arXiv
[9]

Flex Attention: A Programming Model for Generating Optimized Attention Kernels

URL https: //arxiv.org/abs/2412.05496. György Dósa. The tight bound of first fit decreasing bin-packing algorithm is ffd(i) <= 11/9opt(i) + 6/9. ESCAPE’07, pp. 1–11, Berlin, Heidelberg,

work page arXiv
[10]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Springer-Verlag. ISBN 3540744495. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[11]

40 Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel

URL https://arxiv.org/abs/2405.07719. 40 Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. arXiv preprint arXiv:2410.12557,

work page arXiv
[12]

GenmoTeam

URL https://arxiv.org/abs/2502.21231. GenmoTeam. Mochi

work page arXiv
[13]

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai

URL https://arxiv.org/abs/2406.18485. Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning. International Conference on Learning Representations,

work page arXiv
[14]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2205.10487 , year=

Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El- Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, et al. Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487,

work page arXiv
[16]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Video Diffusion Models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Vbench++: Comprehensive and versatile benchmark suite for video generative models

URL https://arxiv.org/ abs/2411.13503. Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimiza- tions for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509,

work page arXiv
[19]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

URL https://arxiv.org/pdf/2309.14509. 41 Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Auto-Encoding Variational Bayes

Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Videopoet: A large language model for zero-shot video generation.arXiv:2312.14125,

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125,

work page arXiv
[22]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Black Forest Labs

URL https: //arxiv.org/abs/2407.09105. Black Forest Labs. Flux. https://github.com/black-forest-labs/flux,

work page arXiv
[24]

Deduplicating training data makes language models better

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499,

work page arXiv
[25]

June 7, 2025.DOI:10.48550/arXiv.2410.06511

URL https://arxiv.org/abs/2410.06511. Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131,

work page arXiv
[26]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022a. Yaofang Liu, Yumeng Ren, Xiaodong Cun, Aitor Artola, Yang Liu, Tieyong Zeng, Ray- mond H Chan, and Jean-michel Morel. Redefining temporal modeling in video diffusion: The vectorized timeste...

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Knowledge distillation in iterative generative models for improved sampling speed

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pp. 12009–12019, 2022b. Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative...

work page arXiv
[30]

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model, February 2025

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v tech- nical report: The practice, challenges, and future of video foundation model. arXiv preprint arXiv:2502.10248,

work page arXiv
[31]

Latte: Latent Diffusion Transformer for Video Generation

42 Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048,

work page internal anchor Pith review arXiv
[32]

Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363,

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation. arXiv preprint arXiv:2410.05363,

work page arXiv
[33]

URL https://arxiv.org/ abs/2501.09038. NVIDIA. Accelerating transformers with nvidia cudnn

work page arXiv
[34]

Accessed: 2024-12-12. OpenAI. Video generation models as world simulators,

work page 2024
[35]

DINOv2: Learning Robust Visual Features without Supervision

URL https://openai.com/ index/video-generation-models-as-world-simulators/ . Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khali- dov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choud- hary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Si...

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

URL https://arxiv.org/ abs/1910.02054. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234–241. Springer,

work page arXiv 1910
[38]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024

URL https://arxiv.org/abs/2407.08608. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202,

work page arXiv 2002
[40]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456,

work page internal anchor Pith review Pith/arXiv arXiv 2011
[41]

Reproducibility — pytorch 2.6 documentation

PyTorch Team. Reproducibility — pytorch 2.6 documentation. https://pytorch.org/docs/ stable/notes/randomness.html, 2024a. PyTorch Team. torch.distributed.tensor — pytorch 2.6 documentation. https://pytorch. org/docs/stable/distributed.tensor.html, 2024b. Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vi...

work page 2020
[42]

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025a. Guoxia Wang, Jinle Zeng, Xiyuan Xiao, Siming Wu, Jiabin Yang, Lujing Zheng, Zeyu Chen, Jiang Bian, Dianhai Yu, and Haifeng Wang. Fl...

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Sam Wiseman and Alexander M Rush

URL https://arxiv.org/abs/2412.01523. Sam Wiseman and Alexander M Rush. Sequence-to-sequence learning as beam-search optimization. arXiv preprint arXiv:1606.02960,

work page arXiv
[44]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp. 38087–38099. PMLR, 2023a. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient stream- ing language models wit...

work page internal anchor Pith review Pith/arXiv arXiv
[45]

URL https://arxiv.org/ abs/2105.04663. Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

work page arXiv
[46]

Clevrer: Collision events for video representation and reasoning

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442,

work page arXiv 1910
[47]

From slow bidirectional to fast causal video generators

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shecht- man, and Xun Huang. From slow bidirectional to fast causal video generators. arXiv preprint arXiv:2412.07772,

work page arXiv
[48]

45 Geng Zhang, Xuanlei Zhao, Kai Wang, and Yang You

URL https://arxiv.org/abs/ 2110.15032. 45 Geng Zhang, Xuanlei Zhao, Kai Wang, and Yang You. Training variable sequences with data-centric parallel,

work page arXiv
[49]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277,

work page internal anchor Pith review Pith/arXiv arXiv
[50]

This enables an equivalent transformation of the form Y = ( X · diag(s)−1) · (diag(s)W), effectively mitigating the impact of outliers in channel-wise activations

A Inference Infra A.1 W8A8 Quantization We adopt the A8W8 SmoothQuant approach (Xiao et al., 2023a), which leverages a cali- bration dataset to pre-compute per-channel scaling factors s. This enables an equivalent transformation of the form Y = ( X · diag(s)−1) · (diag(s)W), effectively mitigating the impact of outliers in channel-wise activations. For ca...

work page 2024
[51]

(see Fig.26), from which we sample to pack and pad to the required seqlen . To calculate the TFLOPs/s for various mask patterns during both forward and backward passes, we use the subsequent equations, following Flash-Attention (Dao, 2023): FLOPs( f wd) = 2|{z} 2 matmul × 2|{z} 2 flops per matmul × MaskArea(seqlen , mask_type ) (12) × batch _size × num_he...

work page 2023
[52]

2 , ... TFLOPs/s(wd) = FLOPs(wd) Runtime(wd), wd ∈ { f wd, bwd} (14) 0K-2K 2K-4K 4K-8K 8K-16K 16K-32K 32K-64K 64K-128K 128K-256K 256K-512K 0M-1M 1M-2M 2M-4M Sequence Length Range 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Probability 0.12 0.04 0.03 0.05 0.06 0.16 0.31 0.15 0.04 0.03 0.01 0.01 Variable Sequence Length Distribution Figure 26: Distribution of s...

work page 2024