pith. sign in

arxiv: 2511.20714 · v2 · submitted 2025-11-25 · 💻 cs.CV · cs.AI

Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Pith reviewed 2026-05-17 05:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords world modelsblock diffusionsemi-autoregressive decodingvideo generationinference engineworld simulationKV cacheLV-Bench
0
0 comments X

The pith

Inferix is a specialized inference engine that uses block-diffusion to generate long coherent videos for world simulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Inferix as a next-generation inference engine built specifically for world models that produce long, physically realistic, and interactive videos. It relies on a semi-autoregressive block-diffusion decoding process that generates video in blocks while conditioning each block on prior ones and reintroduces LLM-style KV cache management to support variable-length outputs. This setup is claimed to yield more coherent sequences than standard video diffusion models and to enable efficient real-time interaction through added streaming and profiling features. The system also integrates a new benchmark, LV-Bench, for evaluating minute-long video generation.

Core claim

Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation sets it apart from high-concurrency systems and classic video diffusion models by merging diffusion and autoregressive strengths, reintroducing KV cache management for efficient, variable-length, and high-quality generation.

What carries the argument

The semi-autoregressive (block-diffusion) decoding paradigm, which generates video tokens in blocks by applying diffusion within each block while conditioning on previous blocks and uses LLM-style KV cache management.

If this is right

  • World models gain the ability to produce longer, more stable video sequences for agentic AI, embodied AI, and gaming.
  • Real-time interaction with simulated environments becomes practical via interactive video streaming and profiling.
  • Minute-long video generation can be benchmarked consistently through seamless LV-Bench integration.
  • Scaling the models may unlock emergent capabilities in visual perception, understanding, and reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid block-based decoding could become a standard approach for extending video generation beyond fixed-length limits.
  • The engine's design suggests direct applicability to training agents in dynamic simulated worlds.
  • Profiling features may help identify bottlenecks in modeling physical dynamics over extended time horizons.

Load-bearing premise

The semi-autoregressive block-diffusion decoding paradigm overcomes the limitations of standard video diffusion through KV cache management, enabling efficient variable-length generation.

What would settle it

An experiment that generates minute-long videos with a standard diffusion model lacking block structure and KV cache and measures equivalent or superior coherence, stability, and speed would falsify the claimed advantage.

read the original abstract

World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation. Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents Inferix, a block-diffusion (semi-autoregressive) inference engine tailored for world simulation and immersive video synthesis. It claims that this paradigm merges diffusion and autoregressive strengths to produce coherent long videos, overcomes standard video diffusion limitations by reintroducing LLM-style KV-cache management for efficient variable-length generation, and distinguishes itself from high-concurrency engines (vLLM, SGLang) and classic video diffusion systems (xDiTs). Additional features include interactive streaming, profiling, and integration with the new LV-Bench benchmark for minute-long video evaluation.

Significance. If the efficiency and quality claims hold under rigorous testing, Inferix could meaningfully advance practical deployment of world models for agentic and embodied AI by enabling longer, more stable video generation at interactive rates. The dedicated focus on simulation rather than generic inference is a clear positioning strength, though the current manuscript provides no empirical grounding to assess whether these advantages materialize.

major comments (2)
  1. [Abstract] Abstract: The central claim that block-diffusion 'reintroduces LLM-style KV Cache management' to achieve efficient, variable-length generation is stated declaratively but supplies neither the attention-mask formulation, caching pseudocode, nor any latency/throughput measurements under a diffusion noise schedule. This mechanism is load-bearing for the asserted superiority over standard video diffusion and high-concurrency engines.
  2. [Abstract] Abstract: No derivations, ablation studies, or quantitative comparisons (e.g., against xDiTs or autoregressive baselines) are provided to support the coherence, stability, or efficiency advantages of the semi-autoregressive block-diffusion approach, rendering the design goals unevaluable from the given text.
minor comments (1)
  1. The manuscript would benefit from explicit section headings and a methods or system-architecture subsection that details the block-diffusion implementation, even at a high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comments point by point below and have revised the manuscript to incorporate the requested technical details and supporting analyses.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that block-diffusion 'reintroduces LLM-style KV Cache management' to achieve efficient, variable-length generation is stated declaratively but supplies neither the attention-mask formulation, caching pseudocode, nor any latency/throughput measurements under a diffusion noise schedule. This mechanism is load-bearing for the asserted superiority over standard video diffusion and high-concurrency engines.

    Authors: We agree that the abstract, as a concise summary, does not contain the full technical specifications. The attention-mask formulation and LLM-style KV-cache adaptation for block-diffusion under diffusion noise schedules are detailed in Section 3 of the manuscript, with pseudocode in Algorithm 1. We have revised the abstract to reference these sections explicitly. Latency and throughput measurements under varying noise schedules have been added to the experimental evaluation section, including direct comparisons demonstrating efficiency advantages. revision: yes

  2. Referee: [Abstract] Abstract: No derivations, ablation studies, or quantitative comparisons (e.g., against xDiTs or autoregressive baselines) are provided to support the coherence, stability, or efficiency advantages of the semi-autoregressive block-diffusion approach, rendering the design goals unevaluable from the given text.

    Authors: We acknowledge that the provided manuscript text focuses on system overview and does not include these supporting elements. In the revision, we have added derivations of the semi-autoregressive block-diffusion paradigm in Appendix A. Ablation studies on block size, conditioning, and coherence metrics are now in Section 4, along with quantitative comparisons of stability and efficiency against xDiTs and autoregressive baselines, evaluated using LV-Bench for minute-long videos. revision: yes

Circularity Check

0 steps flagged

No circularity; declarative positioning without equations or self-referential reductions

full rationale

The abstract and system description present Inferix as a purpose-built engine for world simulation via semi-autoregressive block-diffusion, with claims about KV-cache reintroduction and efficiency stated directly as design advantages rather than derived from any internal equations, fitted parameters, or prior self-citations. No mathematical steps, uniqueness theorems, or ansatzes are shown that reduce the central efficiency assertion back to its own inputs by construction. The text is self-contained in its declarative framing and does not invoke load-bearing self-references or rename known results as novel derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the text is a high-level system description without mathematical derivations or postulated mechanisms.

pith-pipeline@v0.9.0 · 5625 in / 1087 out tokens · 26618 ms · 2026-05-17T05:10:59.466481+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization

    cs.CV 2026-03 unverdicted novelty 6.0

    A dual-tower 4D embodied world model called RoboStereo reduces geometric hallucinations and delivers over 97% relative improvement on manipulation tasks via test-time augmentation, imitative learning, and open exploration.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Dax: Diffusion accelerated execution.https://github.com/RiseAI-Sys/DAX, 2025

  2. [2]

    Mixture of contexts for long video generation.arXiv preprint arXiv:2508.21058,

    Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, et al. Mixture of contexts for long video generation.arXiv preprint arXiv:2508.21058, 2025

  3. [3]

    Sharegpt4v: Improving large multi-modal models with better captions

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pages 370–387. Springer, 2024

  4. [4]

    Mean absolute percentage error for regression models.Neurocomputing, 192:38–48, 2016

    Arnaud De Myttenaere, Boris Golden, Bénédicte Le Grand, and Fabrice Rossi. Mean absolute percentage error for regression models.Neurocomputing, 192:38–48, 2016

  5. [5]

    xdit: an inference engine for diffusion transformers (dits) with massive parallelism.arXiv preprint arXiv:2411.01738, 2024

    Jiarui Fang, Jinzhe Pan, Xibo Sun, Aoyu Li, and Jiannan Wang. xdit: an inference engine for diffusion transformers (dits) with massive parallelism.arXiv preprint arXiv:2411.01738, 2024

  6. [6]

    Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference

    Jiarui Fang, Jinzhe Pan, Jiannan Wang, Aoyu Li, and Xibo Sun. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference. InAdvances in Neural Information Processing Systems, 2025

  7. [7]

    Usp: A unified sequence parallelism approach for long context generative ai, 2024

    Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai, 2024

  8. [8]

    Blade: Block-sparse attention meets step distillation for efficient video generation.arXiv preprint arXiv:2508.10774, 2025

    Youping Gu, Xiaolong Li, Yuhao Hu, Minqi Chen, and Bohan Zhuang. Blade: Block-sparse attention meets step distillation for efficient video generation.arXiv preprint arXiv:2508.10774, 2025

  9. [9]

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al

    Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation.arXiv preprint arXiv:2503.10589, 2025

  10. [10]

    Show and polish: reference-guided identity preservation in face video restoration.arXiv preprint arXiv:2507.10293, 2025

    Wenkang Han, Wang Lin, Yiyun Zhou, Qi Liu, Shulei Wang, Chang Yao, and Jingyuan Chen. Show and polish: reference-guided identity preservation in face video restoration.arXiv preprint arXiv:2507.10293, 2025

  11. [11]

    Openrlhf: An easy-to-use, scalable and high-performance rlhf framework, 2025

    Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, Haotian Xu, and Yiming Liu. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework, 2025

  12. [12]

    Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562– 1577, 2019

    Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562– 1577, 2019

  13. [13]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  14. [14]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion, 2025

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion, 2025

  15. [15]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  16. [16]

    System optimizations for enabling training of extreme long sequence transformer models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Reza Yazdani Aminadabi, Shuai- wen Leon Song, Samyam Rajbhandari, and Yuxiong He. System optimizations for enabling training of extreme long sequence transformer models. InProceedings of the 43rd ACM Symposium on Principles of Distributed Computing, PODC ’24, page 121–130, New York, NY, USA, ...

  17. [17]

    A new metric of absolute percentage error for intermittent demand forecasts.International Journal of Forecasting, 32(3):669–679, 2016

    Sungil Kim and Heeyoung Kim. A new metric of absolute percentage error for intermittent demand forecasts.International Journal of Forecasting, 32(3):669–679, 2016. 7

  18. [18]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  19. [19]

    Infinigen: Efficient generative inference of large language models with dynamic kv cache management, 2024

    Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. Infinigen: Efficient generative inference of large language models with dynamic kv cache management, 2024

  20. [20]

    Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models

    Muyang Li*, Yujun Lin*, Zhekai Zhang*, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025

  21. [21]

    Snapkv: Llm knows what you are looking for before generation, 2024

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation, 2024

  22. [22]

    Longdiff: Training-free long video generation in one go

    Zhuoling Li, Hossein Rahmani, Qiuhong Ke, and Jun Liu. Longdiff: Training-free long video generation in one go. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17789–17798, 2025

  23. [23]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  24. [24]

    Timestep embedding tells: It’s time to cache for video diffusion model, 2025

    Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model, 2025

  25. [25]

    Ringattention with blockwise transformers for near-infinite context

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ringattention with blockwise transformers for near-infinite context. InThe Twelfth International Conference on Learning Representations, 2024

  26. [26]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024

  27. [27]

    Freelong: Training-free long video generation with spectralblend temporal attention.Advances in Neural Information Processing Systems, 37:131434–131455, 2024

    Yu Lu, Yuanzhi Liang, Linchao Zhu, and Yi Yang. Freelong: Training-free long video generation with spectralblend temporal attention.Advances in Neural Information Processing Systems, 37:131434–131455, 2024

  28. [28]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  29. [29]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297. ACM, March 2025

  30. [30]

    Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu, 2023

  31. [31]

    Dancetrack: Multi-object tracking in uniform appearance and diverse motion

    Peize Sun, Jinkun Cao, Yi Jiang, Zehuan Yuan, Song Bai, Kris Kitani, and Ping Luo. Dancetrack: Multi-object tracking in uniform appearance and diverse motion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20993–21002, 2022

  32. [32]

    Fastvideo: A unified framework for accelerated video generation, April 2024

    The FastVideo Team. Fastvideo: A unified framework for accelerated video generation, April 2024

  33. [33]

    MAGI-1: Autoregressive Video Generation at Scale

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

  34. [34]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 8

  35. [35]

    LA VIE: High- quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 2024a

    Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, and Xihui Liu. Loong: Generating minute-level long videos with autoregressive language models.arXiv preprint arXiv:2410.02757, 2024

  36. [36]

    Moca: Identity-preserving text-to-video generation via mixture of cross attention.arXiv preprint arXiv:2508.03034, 2025

    Qi Xie, Yongjia Ma, Donglin Di, Xuehao Gao, and Xun Yang. Moca: Identity-preserving text-to-video generation via mixture of cross attention.arXiv preprint arXiv:2508.03034, 2025

  37. [37]

    Advancing high-resolution video-language representation with large-scale video transcriptions

    Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036–5045, 2022

  38. [38]

    Context parallelism for scalable million-token inference, 2025

    Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jeremy Reizenstein, Jongsoo Park, and Jianyu Huang. Context parallelism for scalable million-token inference, 2025

  39. [39]

    Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation

    Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation. InAdvances in Neural Information Processing Systems, 2025

  40. [40]

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis, 2024

  41. [41]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. 2025

  42. [42]

    Spargeat- tention: Accurate and training-free sparse attention accelerating any model inference

    Jintao Zhang, Chendong Xiang, Haofeng Huang, Haocheng Xi, Jun Zhu, Jianfei Chen, et al. Spargeat- tention: Accurate and training-free sparse attention accelerating any model inference. InInternational Conference on Machine Learning, 2025

  43. [43]

    Frame context packing and drift prevention in next-frame-prediction video diffusion models, 2025

    Lvmin Zhang, Shengqu Cai, Muyang Li, Gordon Wetzstein, and Maneesh Agrawala. Frame context packing and drift prevention in next-frame-prediction video diffusion models, 2025

  44. [44]

    Fast video generation with sliding tile attention, 2025

    Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, and Hao Zhang. Fast video generation with sliding tile attention, 2025

  45. [45]

    Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation, 2024

    Tianchen Zhao, Tongcheng Fang, Enshu Liu, Wan Rui, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation, 2024

  46. [46]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs, 2024. 9