Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Bohan Zhuang; Hanfeng Lu; Inferix Team: Tianyu Feng; Jiahao He; Jiasheng Tang; Jichao Wu; Mingyang Yang; Teng Liu; Wei Wang; Xi Lin

arxiv: 2511.20714 · v2 · submitted 2025-11-25 · 💻 cs.CV · cs.AI

Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Inferix Team: Tianyu Feng , Yizeng Han , Jiahao He , Yuanyu He , Xi Lin , Teng Liu , Hanfeng Lu , Jiasheng Tang

show 7 more authors

Wei Wang Zhiyuan Wang Jichao Wu Mingyang Yang Yinghao Yu Zeyu Zhang Bohan Zhuang

This is my paper

Pith reviewed 2026-05-17 05:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords world modelsblock diffusionsemi-autoregressive decodingvideo generationinference engineworld simulationKV cacheLV-Bench

0 comments

The pith

Inferix is a specialized inference engine that uses block-diffusion to generate long coherent videos for world simulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Inferix as a next-generation inference engine built specifically for world models that produce long, physically realistic, and interactive videos. It relies on a semi-autoregressive block-diffusion decoding process that generates video in blocks while conditioning each block on prior ones and reintroduces LLM-style KV cache management to support variable-length outputs. This setup is claimed to yield more coherent sequences than standard video diffusion models and to enable efficient real-time interaction through added streaming and profiling features. The system also integrates a new benchmark, LV-Bench, for evaluating minute-long video generation.

Core claim

Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation sets it apart from high-concurrency systems and classic video diffusion models by merging diffusion and autoregressive strengths, reintroducing KV cache management for efficient, variable-length, and high-quality generation.

What carries the argument

The semi-autoregressive (block-diffusion) decoding paradigm, which generates video tokens in blocks by applying diffusion within each block while conditioning on previous blocks and uses LLM-style KV cache management.

If this is right

World models gain the ability to produce longer, more stable video sequences for agentic AI, embodied AI, and gaming.
Real-time interaction with simulated environments becomes practical via interactive video streaming and profiling.
Minute-long video generation can be benchmarked consistently through seamless LV-Bench integration.
Scaling the models may unlock emergent capabilities in visual perception, understanding, and reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid block-based decoding could become a standard approach for extending video generation beyond fixed-length limits.
The engine's design suggests direct applicability to training agents in dynamic simulated worlds.
Profiling features may help identify bottlenecks in modeling physical dynamics over extended time horizons.

Load-bearing premise

The semi-autoregressive block-diffusion decoding paradigm overcomes the limitations of standard video diffusion through KV cache management, enabling efficient variable-length generation.

What would settle it

An experiment that generates minute-long videos with a standard diffusion model lacking block structure and KV cache and measures equivalent or superior coherence, stability, and speed would falsify the claimed advantage.

read the original abstract

World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation. Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Inferix is a system paper on a block-diffusion inference engine for world simulation whose efficiency claims rest on an unelaborated KV-cache analogy without mechanism or numbers.

read the letter

Inferix is a system paper introducing an inference engine built on block-diffusion for generating videos in world models. The main takeaway is that it optimizes semi-autoregressive processes for immersive, interactive simulation, setting it apart from general-purpose engines. What the paper does well is articulate the advantages of the block-diffusion paradigm for coherent long sequences. It merges diffusion steps within blocks while conditioning on prior blocks, and it points to LLM-style KV caching as the key to efficiency and variable length. The addition of interactive video streaming, profiling tools, and LV-Bench integration for minute-long video evaluation shows attention to real-world use in agentic AI and gaming. The authors earn credit for focusing on this specialized application rather than broad serving scenarios. The soft spots lie in the supporting evidence. The central claim about overcoming standard video diffusion limits through KV cache management is presented without the underlying mechanism, such as how caching interacts with the noise schedule, or any quantitative results on latency or quality. This matches the stress-test note that the efficiency story rests on an unelaborated analogy. The paper stays descriptive, with design goals but no derivations, comparisons, or error analysis visible. This work is for practitioners developing world simulators or long-form video systems. A reader interested in engineering optimizations for these models would get value from the described architecture and benchmark. The thinking is clear and engages directly with existing paradigms in diffusion and autoregression, so it qualifies as serious. I recommend sending it for peer review to allow referees to examine the full details and any hidden experiments.

Referee Report

2 major / 1 minor

Summary. The paper presents Inferix, a block-diffusion (semi-autoregressive) inference engine tailored for world simulation and immersive video synthesis. It claims that this paradigm merges diffusion and autoregressive strengths to produce coherent long videos, overcomes standard video diffusion limitations by reintroducing LLM-style KV-cache management for efficient variable-length generation, and distinguishes itself from high-concurrency engines (vLLM, SGLang) and classic video diffusion systems (xDiTs). Additional features include interactive streaming, profiling, and integration with the new LV-Bench benchmark for minute-long video evaluation.

Significance. If the efficiency and quality claims hold under rigorous testing, Inferix could meaningfully advance practical deployment of world models for agentic and embodied AI by enabling longer, more stable video generation at interactive rates. The dedicated focus on simulation rather than generic inference is a clear positioning strength, though the current manuscript provides no empirical grounding to assess whether these advantages materialize.

major comments (2)

[Abstract] Abstract: The central claim that block-diffusion 'reintroduces LLM-style KV Cache management' to achieve efficient, variable-length generation is stated declaratively but supplies neither the attention-mask formulation, caching pseudocode, nor any latency/throughput measurements under a diffusion noise schedule. This mechanism is load-bearing for the asserted superiority over standard video diffusion and high-concurrency engines.
[Abstract] Abstract: No derivations, ablation studies, or quantitative comparisons (e.g., against xDiTs or autoregressive baselines) are provided to support the coherence, stability, or efficiency advantages of the semi-autoregressive block-diffusion approach, rendering the design goals unevaluable from the given text.

minor comments (1)

The manuscript would benefit from explicit section headings and a methods or system-architecture subsection that details the block-diffusion implementation, even at a high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comments point by point below and have revised the manuscript to incorporate the requested technical details and supporting analyses.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that block-diffusion 'reintroduces LLM-style KV Cache management' to achieve efficient, variable-length generation is stated declaratively but supplies neither the attention-mask formulation, caching pseudocode, nor any latency/throughput measurements under a diffusion noise schedule. This mechanism is load-bearing for the asserted superiority over standard video diffusion and high-concurrency engines.

Authors: We agree that the abstract, as a concise summary, does not contain the full technical specifications. The attention-mask formulation and LLM-style KV-cache adaptation for block-diffusion under diffusion noise schedules are detailed in Section 3 of the manuscript, with pseudocode in Algorithm 1. We have revised the abstract to reference these sections explicitly. Latency and throughput measurements under varying noise schedules have been added to the experimental evaluation section, including direct comparisons demonstrating efficiency advantages. revision: yes
Referee: [Abstract] Abstract: No derivations, ablation studies, or quantitative comparisons (e.g., against xDiTs or autoregressive baselines) are provided to support the coherence, stability, or efficiency advantages of the semi-autoregressive block-diffusion approach, rendering the design goals unevaluable from the given text.

Authors: We acknowledge that the provided manuscript text focuses on system overview and does not include these supporting elements. In the revision, we have added derivations of the semi-autoregressive block-diffusion paradigm in Appendix A. Ablation studies on block size, conditioning, and coherence metrics are now in Section 4, along with quantitative comparisons of stability and efficiency against xDiTs and autoregressive baselines, evaluated using LV-Bench for minute-long videos. revision: yes

Circularity Check

0 steps flagged

No circularity; declarative positioning without equations or self-referential reductions

full rationale

The abstract and system description present Inferix as a purpose-built engine for world simulation via semi-autoregressive block-diffusion, with claims about KV-cache reintroduction and efficiency stated directly as design advantages rather than derived from any internal equations, fitted parameters, or prior self-citations. No mathematical steps, uniqueness theorems, or ansatzes are shown that reduce the central efficiency assertion back to its own inputs by construction. The text is self-contained in its declarative framing and does not invoke load-bearing self-references or rename known results as novel derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the text is a high-level system description without mathematical derivations or postulated mechanisms.

pith-pipeline@v0.9.0 · 5625 in / 1087 out tokens · 26618 ms · 2026-05-17T05:10:59.466481+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

block-diffusion ... reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat_induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Figure 1 Architecture comparison ... Block Diffusion combines the strengths of both AR and Diffusion

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization
cs.CV 2026-03 unverdicted novelty 6.0

A dual-tower 4D embodied world model called RoboStereo reduces geometric hallucinations and delivers over 97% relative improvement on manipulation tasks via test-time augmentation, imitative learning, and open exploration.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Dax: Diffusion accelerated execution.https://github.com/RiseAI-Sys/DAX, 2025

work page 2025
[2]

Mixture of contexts for long video generation.arXiv preprint arXiv:2508.21058,

Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, et al. Mixture of contexts for long video generation.arXiv preprint arXiv:2508.21058, 2025

work page arXiv 2025
[3]

Sharegpt4v: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pages 370–387. Springer, 2024

work page 2024
[4]

Mean absolute percentage error for regression models.Neurocomputing, 192:38–48, 2016

Arnaud De Myttenaere, Boris Golden, Bénédicte Le Grand, and Fabrice Rossi. Mean absolute percentage error for regression models.Neurocomputing, 192:38–48, 2016

work page 2016
[5]

xdit: an inference engine for diffusion transformers (dits) with massive parallelism.arXiv preprint arXiv:2411.01738, 2024

Jiarui Fang, Jinzhe Pan, Xibo Sun, Aoyu Li, and Jiannan Wang. xdit: an inference engine for diffusion transformers (dits) with massive parallelism.arXiv preprint arXiv:2411.01738, 2024

work page arXiv 2024
[6]

Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference

Jiarui Fang, Jinzhe Pan, Jiannan Wang, Aoyu Li, and Xibo Sun. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[7]

Usp: A unified sequence parallelism approach for long context generative ai, 2024

Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai, 2024

work page 2024
[8]

Blade: Block-sparse attention meets step distillation for efficient video generation.arXiv preprint arXiv:2508.10774, 2025

Youping Gu, Xiaolong Li, Yuhao Hu, Minqi Chen, and Bohan Zhuang. Blade: Block-sparse attention meets step distillation for efficient video generation.arXiv preprint arXiv:2508.10774, 2025

work page arXiv 2025
[9]

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al

Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation.arXiv preprint arXiv:2503.10589, 2025

work page arXiv 2025
[10]

Show and polish: reference-guided identity preservation in face video restoration.arXiv preprint arXiv:2507.10293, 2025

Wenkang Han, Wang Lin, Yiyun Zhou, Qi Liu, Shulei Wang, Chang Yao, and Jingyuan Chen. Show and polish: reference-guided identity preservation in face video restoration.arXiv preprint arXiv:2507.10293, 2025

work page arXiv 2025
[11]

Openrlhf: An easy-to-use, scalable and high-performance rlhf framework, 2025

Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, Haotian Xu, and Yiming Liu. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework, 2025

work page 2025
[12]

Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562– 1577, 2019

Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562– 1577, 2019

work page 2019
[13]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Self forcing: Bridging the train-test gap in autoregressive video diffusion, 2025

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion, 2025

work page 2025
[15]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024
[16]

System optimizations for enabling training of extreme long sequence transformer models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Reza Yazdani Aminadabi, Shuai- wen Leon Song, Samyam Rajbhandari, and Yuxiong He. System optimizations for enabling training of extreme long sequence transformer models. InProceedings of the 43rd ACM Symposium on Principles of Distributed Computing, PODC ’24, page 121–130, New York, NY, USA, ...

work page 2024
[17]

A new metric of absolute percentage error for intermittent demand forecasts.International Journal of Forecasting, 32(3):669–679, 2016

Sungil Kim and Heeyoung Kim. A new metric of absolute percentage error for intermittent demand forecasts.International Journal of Forecasting, 32(3):669–679, 2016. 7

work page 2016
[18]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[19]

Infinigen: Efficient generative inference of large language models with dynamic kv cache management, 2024

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. Infinigen: Efficient generative inference of large language models with dynamic kv cache management, 2024

work page 2024
[20]

Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models

Muyang Li*, Yujun Lin*, Zhekai Zhang*, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[21]

Snapkv: Llm knows what you are looking for before generation, 2024

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation, 2024

work page 2024
[22]

Longdiff: Training-free long video generation in one go

Zhuoling Li, Hossein Rahmani, Qiuhong Ke, and Jun Liu. Longdiff: Training-free long video generation in one go. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17789–17798, 2025

work page 2025
[23]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Timestep embedding tells: It’s time to cache for video diffusion model, 2025

Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model, 2025

work page 2025
[25]

Ringattention with blockwise transformers for near-infinite context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ringattention with blockwise transformers for near-infinite context. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[26]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Freelong: Training-free long video generation with spectralblend temporal attention.Advances in Neural Information Processing Systems, 37:131434–131455, 2024

Yu Lu, Yuanzhi Liang, Linchao Zhu, and Yi Yang. Freelong: Training-free long video generation with spectralblend temporal attention.Advances in Neural Information Processing Systems, 37:131434–131455, 2024

work page 2024
[28]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[29]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297. ACM, March 2025

work page 2025
[30]

Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu, 2023

work page 2023
[31]

Dancetrack: Multi-object tracking in uniform appearance and diverse motion

Peize Sun, Jinkun Cao, Yi Jiang, Zehuan Yuan, Song Bai, Kris Kitani, and Ping Luo. Dancetrack: Multi-object tracking in uniform appearance and diverse motion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20993–21002, 2022

work page 2022
[32]

Fastvideo: A unified framework for accelerated video generation, April 2024

The FastVideo Team. Fastvideo: A unified framework for accelerated video generation, April 2024

work page 2024
[33]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

LA VIE: High- quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 2024a

Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, and Xihui Liu. Loong: Generating minute-level long videos with autoregressive language models.arXiv preprint arXiv:2410.02757, 2024

work page arXiv 2024
[36]

Moca: Identity-preserving text-to-video generation via mixture of cross attention.arXiv preprint arXiv:2508.03034, 2025

Qi Xie, Yongjia Ma, Donglin Di, Xuehao Gao, and Xun Yang. Moca: Identity-preserving text-to-video generation via mixture of cross attention.arXiv preprint arXiv:2508.03034, 2025

work page arXiv 2025
[37]

Advancing high-resolution video-language representation with large-scale video transcriptions

Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036–5045, 2022

work page 2022
[38]

Context parallelism for scalable million-token inference, 2025

Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jeremy Reizenstein, Jongsoo Park, and Jianyu Huang. Context parallelism for scalable million-token inference, 2025

work page 2025
[39]

Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[40]

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis, 2024

work page 2024
[41]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. 2025

work page 2025
[42]

Spargeat- tention: Accurate and training-free sparse attention accelerating any model inference

Jintao Zhang, Chendong Xiang, Haofeng Huang, Haocheng Xi, Jun Zhu, Jianfei Chen, et al. Spargeat- tention: Accurate and training-free sparse attention accelerating any model inference. InInternational Conference on Machine Learning, 2025

work page 2025
[43]

Frame context packing and drift prevention in next-frame-prediction video diffusion models, 2025

Lvmin Zhang, Shengqu Cai, Muyang Li, Gordon Wetzstein, and Maneesh Agrawala. Frame context packing and drift prevention in next-frame-prediction video diffusion models, 2025

work page 2025
[44]

Fast video generation with sliding tile attention, 2025

Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, and Hao Zhang. Fast video generation with sliding tile attention, 2025

work page 2025
[45]

Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation, 2024

Tianchen Zhao, Tongcheng Fang, Enshu Liu, Wan Rui, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation, 2024

work page 2024
[46]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs, 2024. 9

work page 2024

[1] [1]

Dax: Diffusion accelerated execution.https://github.com/RiseAI-Sys/DAX, 2025

work page 2025

[2] [2]

Mixture of contexts for long video generation.arXiv preprint arXiv:2508.21058,

Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, et al. Mixture of contexts for long video generation.arXiv preprint arXiv:2508.21058, 2025

work page arXiv 2025

[3] [3]

Sharegpt4v: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pages 370–387. Springer, 2024

work page 2024

[4] [4]

Mean absolute percentage error for regression models.Neurocomputing, 192:38–48, 2016

Arnaud De Myttenaere, Boris Golden, Bénédicte Le Grand, and Fabrice Rossi. Mean absolute percentage error for regression models.Neurocomputing, 192:38–48, 2016

work page 2016

[5] [5]

xdit: an inference engine for diffusion transformers (dits) with massive parallelism.arXiv preprint arXiv:2411.01738, 2024

Jiarui Fang, Jinzhe Pan, Xibo Sun, Aoyu Li, and Jiannan Wang. xdit: an inference engine for diffusion transformers (dits) with massive parallelism.arXiv preprint arXiv:2411.01738, 2024

work page arXiv 2024

[6] [6]

Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference

Jiarui Fang, Jinzhe Pan, Jiannan Wang, Aoyu Li, and Xibo Sun. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference. InAdvances in Neural Information Processing Systems, 2025

work page 2025

[7] [7]

Usp: A unified sequence parallelism approach for long context generative ai, 2024

Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai, 2024

work page 2024

[8] [8]

Blade: Block-sparse attention meets step distillation for efficient video generation.arXiv preprint arXiv:2508.10774, 2025

Youping Gu, Xiaolong Li, Yuhao Hu, Minqi Chen, and Bohan Zhuang. Blade: Block-sparse attention meets step distillation for efficient video generation.arXiv preprint arXiv:2508.10774, 2025

work page arXiv 2025

[9] [9]

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al

Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation.arXiv preprint arXiv:2503.10589, 2025

work page arXiv 2025

[10] [10]

Show and polish: reference-guided identity preservation in face video restoration.arXiv preprint arXiv:2507.10293, 2025

Wenkang Han, Wang Lin, Yiyun Zhou, Qi Liu, Shulei Wang, Chang Yao, and Jingyuan Chen. Show and polish: reference-guided identity preservation in face video restoration.arXiv preprint arXiv:2507.10293, 2025

work page arXiv 2025

[11] [11]

Openrlhf: An easy-to-use, scalable and high-performance rlhf framework, 2025

Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, Haotian Xu, and Yiming Liu. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework, 2025

work page 2025

[12] [12]

Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562– 1577, 2019

Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562– 1577, 2019

work page 2019

[13] [13]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Self forcing: Bridging the train-test gap in autoregressive video diffusion, 2025

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion, 2025

work page 2025

[15] [15]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024

[16] [16]

System optimizations for enabling training of extreme long sequence transformer models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Reza Yazdani Aminadabi, Shuai- wen Leon Song, Samyam Rajbhandari, and Yuxiong He. System optimizations for enabling training of extreme long sequence transformer models. InProceedings of the 43rd ACM Symposium on Principles of Distributed Computing, PODC ’24, page 121–130, New York, NY, USA, ...

work page 2024

[17] [17]

A new metric of absolute percentage error for intermittent demand forecasts.International Journal of Forecasting, 32(3):669–679, 2016

Sungil Kim and Heeyoung Kim. A new metric of absolute percentage error for intermittent demand forecasts.International Journal of Forecasting, 32(3):669–679, 2016. 7

work page 2016

[18] [18]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[19] [19]

Infinigen: Efficient generative inference of large language models with dynamic kv cache management, 2024

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. Infinigen: Efficient generative inference of large language models with dynamic kv cache management, 2024

work page 2024

[20] [20]

Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models

Muyang Li*, Yujun Lin*, Zhekai Zhang*, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[21] [21]

Snapkv: Llm knows what you are looking for before generation, 2024

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation, 2024

work page 2024

[22] [22]

Longdiff: Training-free long video generation in one go

Zhuoling Li, Hossein Rahmani, Qiuhong Ke, and Jun Liu. Longdiff: Training-free long video generation in one go. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17789–17798, 2025

work page 2025

[23] [23]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Timestep embedding tells: It’s time to cache for video diffusion model, 2025

Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model, 2025

work page 2025

[25] [25]

Ringattention with blockwise transformers for near-infinite context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ringattention with blockwise transformers for near-infinite context. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[26] [26]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Freelong: Training-free long video generation with spectralblend temporal attention.Advances in Neural Information Processing Systems, 37:131434–131455, 2024

Yu Lu, Yuanzhi Liang, Linchao Zhu, and Yi Yang. Freelong: Training-free long video generation with spectralblend temporal attention.Advances in Neural Information Processing Systems, 37:131434–131455, 2024

work page 2024

[28] [28]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023

[29] [29]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297. ACM, March 2025

work page 2025

[30] [30]

Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu, 2023

work page 2023

[31] [31]

Dancetrack: Multi-object tracking in uniform appearance and diverse motion

Peize Sun, Jinkun Cao, Yi Jiang, Zehuan Yuan, Song Bai, Kris Kitani, and Ping Luo. Dancetrack: Multi-object tracking in uniform appearance and diverse motion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20993–21002, 2022

work page 2022

[32] [32]

Fastvideo: A unified framework for accelerated video generation, April 2024

The FastVideo Team. Fastvideo: A unified framework for accelerated video generation, April 2024

work page 2024

[33] [33]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

LA VIE: High- quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 2024a

Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, and Xihui Liu. Loong: Generating minute-level long videos with autoregressive language models.arXiv preprint arXiv:2410.02757, 2024

work page arXiv 2024

[36] [36]

Moca: Identity-preserving text-to-video generation via mixture of cross attention.arXiv preprint arXiv:2508.03034, 2025

Qi Xie, Yongjia Ma, Donglin Di, Xuehao Gao, and Xun Yang. Moca: Identity-preserving text-to-video generation via mixture of cross attention.arXiv preprint arXiv:2508.03034, 2025

work page arXiv 2025

[37] [37]

Advancing high-resolution video-language representation with large-scale video transcriptions

Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036–5045, 2022

work page 2022

[38] [38]

Context parallelism for scalable million-token inference, 2025

Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jeremy Reizenstein, Jongsoo Park, and Jianyu Huang. Context parallelism for scalable million-token inference, 2025

work page 2025

[39] [39]

Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation. InAdvances in Neural Information Processing Systems, 2025

work page 2025

[40] [40]

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis, 2024

work page 2024

[41] [41]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. 2025

work page 2025

[42] [42]

Spargeat- tention: Accurate and training-free sparse attention accelerating any model inference

Jintao Zhang, Chendong Xiang, Haofeng Huang, Haocheng Xi, Jun Zhu, Jianfei Chen, et al. Spargeat- tention: Accurate and training-free sparse attention accelerating any model inference. InInternational Conference on Machine Learning, 2025

work page 2025

[43] [43]

Frame context packing and drift prevention in next-frame-prediction video diffusion models, 2025

Lvmin Zhang, Shengqu Cai, Muyang Li, Gordon Wetzstein, and Maneesh Agrawala. Frame context packing and drift prevention in next-frame-prediction video diffusion models, 2025

work page 2025

[44] [44]

Fast video generation with sliding tile attention, 2025

Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, and Hao Zhang. Fast video generation with sliding tile attention, 2025

work page 2025

[45] [45]

Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation, 2024

Tianchen Zhao, Tongcheng Fang, Enshu Liu, Wan Rui, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation, 2024

work page 2024

[46] [46]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs, 2024. 9

work page 2024