SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Enze Xie; Haozhe Liu; Jincheng Yu; Junsong Chen; Qiyuan He; Song Han; Tian Ye; Yicheng Pan; Yuyang Zhao

arxiv: 2605.30409 · v1 · pith:MJKZ7TEEnew · submitted 2026-05-28 · 💻 cs.CV · cs.AI

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Yuyang Zhao , Yicheng Pan , Qiyuan He , Jincheng Yu , Junsong Chen , Tian Ye , Haozhe Liu , Enze Xie

show 1 more author

Song Han

This is my paper

Pith reviewed 2026-06-29 07:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords streaming video editingdiffusion transformerreal-time inferencetemporal consistencyhybrid architecturemixed precision quantizationflow matching regularization

0 comments

The pith

A hybrid diffusion transformer with cycle-reverse regularization and hardware co-design achieves real-time 1280x704 video editing at 24 FPS on one consumer GPU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SANA-Streaming as a system-algorithm co-design for streaming video-to-video editing that must meet strict demands for temporal consistency and high inference speed. It combines a hybrid transformer architecture that adds softmax attention selectively to linear blocks, a cycle-reverse training method that uses flow matching to enforce consistency by reversing edits, and targeted optimizations including fused kernels and mixed-precision quantization. The goal is to deliver usable performance for live applications without needing paired long edited video datasets. If the designs hold, they produce measurable gains in both coherence and throughput compared with prior methods. The reported outcome is end-to-end operation at 24 FPS with the core model at 58 FPS on a single RTX 5090.

Core claim

The central claim is that a Hybrid Diffusion Transformer using partial softmax attention, trained via Cycle-Reverse Regularization that predicts source frames from generated content through flow matching, and paired with Blackwell-specific fused GDN kernels plus mixed-precision quantization, produces real-time streaming video editing at 1280 x 704 resolution and 24 end-to-end FPS while improving temporal coherence over existing state-of-the-art approaches.

What carries the argument

The Hybrid Diffusion Transformer architecture, which mixes linear attention blocks with selective softmax attention blocks to strengthen local modeling while retaining linear efficiency.

If this is right

Real-time 1280 x 704 editing reaches 24 end-to-end FPS with the diffusion transformer core at 58 FPS on a single RTX 5090 GPU.
Temporal consistency improves without access to paired long edited video datasets.
Mixed-precision quantization and fused kernels raise throughput while preserving generation quality.
The full co-design outperforms prior methods on both coherence and system speed metrics.
Interactive applications such as live broadcasting become feasible on consumer hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hybrid attention pattern and cycle training could reduce data needs for other generative video tasks that lack paired long sequences.
If the regularization generalizes across input lengths, it may lower the barrier for deploying streaming models on varied consumer GPUs.
Hardware-specific quantization choices may transfer to similar diffusion backbones, suggesting a template for co-design in other real-time generation settings.
Success here would encourage testing whether selective softmax blocks improve local fidelity in non-editing video diffusion pipelines.

Load-bearing premise

The cycle-reverse regularization produces stable temporal coherence on real streaming inputs without paired long edited videos for training.

What would settle it

Running the system on extended unedited streaming video sequences and measuring whether temporal coherence metrics drop below those of prior methods when the cycle-reverse term is removed.

read the original abstract

Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements for temporal consistency and inference throughput. In this paper, we present SANA-Streaming, a system-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs: (1) Hybrid Diffusion Transformer architecture introduces softmax attention in part of the blocks to improve local modeling capabilities while preserving the efficiency of linear layers. (2) Cycle-Reverse Regularization is a novel training strategy that enforces semantic consistency by predicting source frames from generated content via flow matching, improving temporal consistency without requiring paired long edited videos. (3) Efficient System Co-design combines fused GDN kernels and Mixed-Precision Quantization (MPQ) optimized for the NVIDIA Blackwell (RTX 5090) architecture. By profiling real-world throughput, our MPQ maximizes Tensor Core utilization while maintaining generation quality. The resulting system achieves real-time 1280 x 704 resolution editing at 24 end-to-end FPS on a single RTX 5090 GPU, with the DiT core running at 58 FPS. Experimental results demonstrate that our co-design approach significantly outperforms existing SOTA methods in both temporal coherence and system throughput.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SANA-Streaming combines a hybrid DiT, cycle-reverse flow-matching regularization, and Blackwell-specific MPQ to claim 24 FPS real-time 1280x704 V2V editing on one RTX 5090.

read the letter

This paper describes a working system for real-time streaming video-to-video editing. It reports 24 end-to-end FPS at 1280x704 on a single RTX 5090, with the DiT core at 58 FPS.

The three pieces that look new are the hybrid Diffusion Transformer (softmax attention in some blocks, linear layers elsewhere), the cycle-reverse regularization that uses flow matching to predict source frames from generated content for temporal consistency, and the mixed-precision quantization plus fused kernels tuned specifically for the Blackwell architecture. The training signal avoids the need for paired long edited videos, which is a practical advantage.

The paper does well by tying architecture, training, and hardware choices into one co-design and by stating concrete, falsifiable speed and resolution targets. The cycle-reverse idea directly addresses a data bottleneck that many video editing methods face.

The main soft spot is that the provided abstract gives no tables, ablations, or baseline details, so the claim of outperforming SOTA in both coherence and throughput cannot be checked from the summary alone. If the full paper supplies those comparisons with proper controls, the concern shrinks. No circularity or hidden fitting shows up in the description.

This is for CV researchers working on efficient diffusion models for interactive video applications. A reader who needs real-time performance on consumer GPUs would get concrete techniques to examine.

It deserves peer review because the contributions are specific enough to test and the performance numbers are stated plainly.

Referee Report

2 major / 2 minor

Summary. The manuscript presents SANA-Streaming, a system-algorithm co-designed framework for real-time streaming video-to-video editing. Core contributions include (1) a hybrid Diffusion Transformer that mixes softmax attention blocks with linear layers for improved local modeling, (2) cycle-reverse regularization that uses flow matching to predict source frames from generated content for semantic consistency without paired long edited videos, and (3) Blackwell-specific fused GDN kernels and mixed-precision quantization (MPQ) to maximize Tensor Core utilization. The paper claims the resulting system achieves 24 end-to-end FPS at 1280×704 resolution on a single RTX 5090, with the DiT core at 58 FPS, and significantly outperforms existing SOTA methods in temporal coherence and throughput.

Significance. If the performance numbers and outperformance claims hold under rigorous evaluation, the work would be significant for enabling interactive real-time V2V editing on consumer GPUs. The hybrid architecture, the cycle-reverse training signal that sidesteps the need for paired long videos, and the hardware-specific MPQ co-design represent concrete advances. The explicit, falsifiable targets (24 FPS end-to-end, 58 FPS DiT core at the stated resolution) and the absence of hidden parameters in the core claims are strengths that make the results directly testable.

major comments (2)

Abstract and Experimental Results section: the central claims of 24 end-to-end FPS, 58 FPS DiT core, and significant outperformance over SOTA in temporal coherence and throughput are stated without any quantitative tables, error bars, baseline details, ablation studies, or metric definitions visible in the provided text. This absence makes the load-bearing performance assertions impossible to verify from the manuscript as presented.
Core design 2 (cycle-reverse regularization): the description states that the strategy enforces semantic consistency by predicting source frames from generated content via flow matching without paired long edited videos, but no equations, loss formulation, or training details are supplied to show how this signal is implemented or stabilized; if the regularization fails to produce stable coherence on real streaming inputs, the claimed real-time advantage would not hold.

minor comments (2)

Abstract: the resolution is given as '1280 x 704' without clarifying whether this is width × height or the exact aspect ratio used in all experiments.
Abstract: 'Blackwell (RTX 5090)' should note that the RTX 5090 is a consumer Blackwell part; any architecture-specific claims should be scoped accordingly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the manuscript accordingly to improve clarity, verifiability, and completeness.

read point-by-point responses

Referee: Abstract and Experimental Results section: the central claims of 24 end-to-end FPS, 58 FPS DiT core, and significant outperformance over SOTA in temporal coherence and throughput are stated without any quantitative tables, error bars, baseline details, ablation studies, or metric definitions visible in the provided text. This absence makes the load-bearing performance assertions impossible to verify from the manuscript as presented.

Authors: We agree that the performance claims require explicit supporting data for verification. The complete manuscript contains an Experimental Results section with quantitative tables for FPS measurements and SOTA comparisons, ablation studies, error bars from multiple runs, baseline details, and metric definitions (including temporal coherence). In the revision we will add explicit cross-references from the abstract to these tables, include a compact summary table of key metrics near the introduction, and ensure all numerical claims are directly traceable to visible experimental results. revision: yes
Referee: Core design 2 (cycle-reverse regularization): the description states that the strategy enforces semantic consistency by predicting source frames from generated content via flow matching without paired long edited videos, but no equations, loss formulation, or training details are supplied to show how this signal is implemented or stabilized; if the regularization fails to produce stable coherence on real streaming inputs, the claimed real-time advantage would not hold.

Authors: We agree that the current description of cycle-reverse regularization is insufficiently detailed. In the revised manuscript we will supply the full mathematical formulation, including the flow-matching loss for predicting source frames from generated content, the overall training objective, implementation specifics, and any stabilization methods used. This will enable assessment of the regularization's stability and its contribution to the reported real-time performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent architectural and empirical choices

full rationale

The paper's core claims rest on three explicit design decisions (hybrid DiT blocks with partial softmax attention, cycle-reverse flow-matching regularization, and Blackwell-specific MPQ + fused kernels) whose performance is reported via direct hardware measurements (24 end-to-end FPS at 1280×704, DiT core at 58 FPS). No equations, fitted parameters, or self-citations are shown that would reduce these metrics or the temporal-coherence improvement to quantities defined by the same inputs. The training signal is described as avoiding paired long videos, and the throughput numbers are presented as profiled results rather than derived predictions. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated beyond standard diffusion model assumptions.

pith-pipeline@v0.9.1-grok · 5793 in / 1167 out tokens · 22263 ms · 2026-06-29T07:37:07.574828+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 28 canonical work pages · 19 internal anchors

[1]

Videox-fun: A video generation pipeline for diffusion transformer, 2026

aigc apps. Videox-fun: A video generation pipeline for diffusion transformer, 2026

2026
[2]

Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

work page arXiv 2025
[3]

Two deterministic half- quadratic regularization algorithms for computed imaging

Pierre Charbonnier, Laure Blanc-Feraud, Gilles Aubert, and Michel Barlaud. Two deterministic half- quadratic regularization algorithms for computed imaging. InProceedings of 1st international conference on image processing, volume 2, pages 168–172. IEEE, 1994

1994
[4]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

2024
[5]

Sana-sprint: One-step diffusion with continuous-time consistency distillation

Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. Sana-sprint: One-step diffusion with continuous-time consistency distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16185–16195, 2025

2025
[6]

Sana-video: Efficient video generation with block linear diffusion transformer

Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al. Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

work page arXiv 2025
[7]

Longlive-2.0: An nvfp4 parallel infrastructure for long video generation, 2026

Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, and Song Han. Longlive-2.0: An nvfp4 parallel infrastructure for long video generation, 2026

2026
[8]

Lol: Longer than longer, scaling video generation to hour.arXiv preprint arXiv:2601.16914, 2026

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Lol: Longer than longer, scaling video generation to hour.arXiv preprint arXiv:2601.16914, 2026

work page arXiv 2026
[9]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

2022
[11]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Jonathan Ho and Tim Salimans

Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu, Qiangpeng Yang, Shilei Wen, and Lei Xie. Openve-3m: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025

work page arXiv 2025
[15]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025. 11 SANA-Streaming : Real-time Streaming Video Editing with Hybrid Diffusion Transformer

2025
[18]

Perceptual losses for real-time style transfer and super-resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. InEuropean conference on computer vision, pages 694–711. Springer, 2016

2016
[19]

In-context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025

Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu, and Guosheng Lin. In-context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025

work page arXiv 2025
[20]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

2026
[21]

Linear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. InInternational conference on machine learning, pages 9355–9366. PMLR, 2021

2021
[22]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, and Hao Li. Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

work page arXiv 2025
[25]

Lucy edit: Open-weight text-guided video editing

DecartAI Team. Lucy edit: Open-weight text-guided video editing. 2025

2025
[26]

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Insvie-1m: Effective instruction-based video editing with elaborate dataset construction

Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, and Lei Zhang. Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16692–16701, 2025

2025
[30]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427, 2025

Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Chengyue Wu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427, 2025

work page arXiv 2025
[33]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2023. 12 SANA-Streaming : Real-time Streaming Video Editing with Hybrid Diffusion Transformer

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Ultraflux: Data-model co-design for high-quality native 4k text-to-image generation across diverse aspect ratios.arXiv preprint arXiv:2511.18050, 2025

Tian Ye, Song Fei, and Lei Zhu. Ultraflux: Data-model co-design for high-quality native 4k text-to-image generation across diverse aspect ratios.arXiv preprint arXiv:2511.18050, 2025

work page arXiv 2025
[38]

Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

2024
[39]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

2024
[40]

Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

2019
[41]

Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models

Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23464–23473, 2025

2025
[42]

Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse-linear attention

Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, et al. Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse-linear attention.arXiv preprint arXiv:2509.24006, 2025

work page arXiv 2025
[43]

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Haoyi Zhu, Haozhe Liu, Yuyang Zhao, Tian Ye, Junsong Chen, Jincheng Yu, Tong He, Song Han, and Enze Xie. Sana-wm: Efficient minute-scale world modeling with hybrid linear diffusion transformer. arXiv preprint arXiv:2605.15178, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autore- gressive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

RoPE-on-numerator-only

Ya Zou, Jingfeng Yao, Siyuan Yu, Shuai Zhang, Wenyu Liu, and Xinggang Wang. Turbo-vaed: Fast and stable transfer of video-vaes to mobile devices. InProceedings of the AAAI Conference on Artificial Intelligence, 2026. 13 SANA-Streaming : Real-time Streaming Video Editing with Hybrid Diffusion Transformer A. Mixed-Precision Quantization Search setup and met...

2026

[1] [1]

Videox-fun: A video generation pipeline for diffusion transformer, 2026

aigc apps. Videox-fun: A video generation pipeline for diffusion transformer, 2026

2026

[2] [2]

Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

work page arXiv 2025

[3] [3]

Two deterministic half- quadratic regularization algorithms for computed imaging

Pierre Charbonnier, Laure Blanc-Feraud, Gilles Aubert, and Michel Barlaud. Two deterministic half- quadratic regularization algorithms for computed imaging. InProceedings of 1st international conference on image processing, volume 2, pages 168–172. IEEE, 1994

1994

[4] [4]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

2024

[5] [5]

Sana-sprint: One-step diffusion with continuous-time consistency distillation

Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. Sana-sprint: One-step diffusion with continuous-time consistency distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16185–16195, 2025

2025

[6] [6]

Sana-video: Efficient video generation with block linear diffusion transformer

Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al. Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

work page arXiv 2025

[7] [7]

Longlive-2.0: An nvfp4 parallel infrastructure for long video generation, 2026

Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, and Song Han. Longlive-2.0: An nvfp4 parallel infrastructure for long video generation, 2026

2026

[8] [8]

Lol: Longer than longer, scaling video generation to hour.arXiv preprint arXiv:2601.16914, 2026

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Lol: Longer than longer, scaling video generation to hour.arXiv preprint arXiv:2601.16914, 2026

work page arXiv 2026

[9] [9]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

2022

[11] [11]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

Jonathan Ho and Tim Salimans

Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu, Qiangpeng Yang, Shilei Wen, and Lei Xie. Openve-3m: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025

work page arXiv 2025

[15] [15]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025. 11 SANA-Streaming : Real-time Streaming Video Editing with Hybrid Diffusion Transformer

2025

[18] [18]

Perceptual losses for real-time style transfer and super-resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. InEuropean conference on computer vision, pages 694–711. Springer, 2016

2016

[19] [19]

In-context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025

Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu, and Guosheng Lin. In-context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025

work page arXiv 2025

[20] [20]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

2026

[21] [21]

Linear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. InInternational conference on machine learning, pages 9355–9366. PMLR, 2021

2021

[22] [22]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [23]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, and Hao Li. Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

work page arXiv 2025

[25] [25]

Lucy edit: Open-weight text-guided video editing

DecartAI Team. Lucy edit: Open-weight text-guided video editing. 2025

2025

[26] [26]

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Insvie-1m: Effective instruction-based video editing with elaborate dataset construction

Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, and Lei Zhang. Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16692–16701, 2025

2025

[30] [30]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427, 2025

Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Chengyue Wu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427, 2025

work page arXiv 2025

[33] [33]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2023. 12 SANA-Streaming : Real-time Streaming Video Editing with Hybrid Diffusion Transformer

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Ultraflux: Data-model co-design for high-quality native 4k text-to-image generation across diverse aspect ratios.arXiv preprint arXiv:2511.18050, 2025

Tian Ye, Song Fei, and Lei Zhu. Ultraflux: Data-model co-design for high-quality native 4k text-to-image generation across diverse aspect ratios.arXiv preprint arXiv:2511.18050, 2025

work page arXiv 2025

[38] [38]

Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

2024

[39] [39]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

2024

[40] [40]

Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

2019

[41] [41]

Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models

Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23464–23473, 2025

2025

[42] [42]

Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse-linear attention

Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, et al. Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse-linear attention.arXiv preprint arXiv:2509.24006, 2025

work page arXiv 2025

[43] [43]

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Haoyi Zhu, Haozhe Liu, Yuyang Zhao, Tian Ye, Junsong Chen, Jincheng Yu, Tong He, Song Han, and Enze Xie. Sana-wm: Efficient minute-scale world modeling with hybrid linear diffusion transformer. arXiv preprint arXiv:2605.15178, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[44] [44]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autore- gressive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [45]

RoPE-on-numerator-only

Ya Zou, Jingfeng Yao, Siyuan Yu, Shuai Zhang, Wenyu Liu, and Xinggang Wang. Turbo-vaed: Fast and stable transfer of video-vaes to mobile devices. InProceedings of the AAAI Conference on Artificial Intelligence, 2026. 13 SANA-Streaming : Real-time Streaming Video Editing with Hybrid Diffusion Transformer A. Mixed-Precision Quantization Search setup and met...

2026