Q-ARVD: Quantizing Autoregressive Video Diffusion Models

Gongfan Fang; Siao Tang; Xinchao Wang; Xingyi Yang; Xinyin Ma

arxiv: 2605.21072 · v1 · pith:VKYLGOO3new · submitted 2026-05-20 · 💻 cs.CV

Q-ARVD: Quantizing Autoregressive Video Diffusion Models

Siao Tang , Xinyin Ma , Gongfan Fang , Xingyi Yang , Xinchao Wang This is my paper

Pith reviewed 2026-05-21 05:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords autoregressive video diffusionmodel quantizationvideo generationefficient inferenceoutlier handlingframe weightingdiffusion modelsmodel compression

0 comments

The pith

A new quantization method for autoregressive video diffusion models uses final-quality frame weighting and adaptive dual-scale outlier handling to maintain generation quality at low precision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that autoregressive video diffusion models suffer from distinct quantization problems not seen in standard diffusion models, specifically error buildup that makes sensitivity decay exponentially across frames and outlier patterns that differ by layer type and depth. It introduces a frame-weighting scheme in the quantization objective that emphasizes later frames for overall quality and an adaptive dual-scale quantizer that detects outlier channels in any layer and isolates them from normal ones. This matters for readers because ARVDs support streaming and interactive video generation yet face high inference costs that block deployment; effective quantization would make such models runnable on everyday hardware. The authors demonstrate through experiments that these targeted fixes outperform direct application of prior quantization techniques developed for bidirectional models.

Core claim

The central claim is that directly applying existing quantization schemes to ARVDs yields suboptimal results because of two ARVD-specific challenges: highly unbalanced frame-wise quantization sensitivity caused by autoregressive error accumulation that follows an exponential-like decay, and prominent heterogeneous outlier patterns in weights that vary across layer types and block depths. Q-ARVD addresses the first by adding a final-quality aware frame-weighting mechanism to the quantization objective and the second by an outlier-aware adaptive dual-scale quantization that automatically detects the presence and quantity of outlier channels for an arbitrary layer and isolates them to protect正常

What carries the argument

The central mechanisms are the final-quality aware frame-weighting mechanism that adjusts the quantization loss to prioritize end-of-sequence frames and the outlier-aware adaptive dual-scale quantization that detects outlier channels per layer and applies separate scaling to isolate them.

If this is right

ARVDs become deployable for real-time interactive video generation with substantially lower inference compute.
Quantization error accumulation across autoregressive frames is reduced so that early-frame mistakes do not cascade as severely.
Layers with varying outlier distributions receive appropriate scaling without manual per-layer tuning.
Overall video generation quality at low bit widths stays closer to the full-precision baseline than with existing diffusion quantization methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same frame-weighting idea could be tested on autoregressive models for other sequential data such as audio or text to see if sensitivity decay appears in those domains too.
Combining the dual-scale outlier isolation with existing post-training quantization pipelines for transformers might yield further gains in non-video settings.
If the mechanisms prove robust, they could support running ARVDs on mobile or edge devices for on-device world modeling applications.

Load-bearing premise

The two challenges of unbalanced frame sensitivity and heterogeneous outliers are the dominant causes of poor quantization performance in ARVDs and the proposed weighting and dual-scale mechanisms will generalize across model scales and video domains without per-model retuning.

What would settle it

Apply Q-ARVD to a different autoregressive video diffusion model or video domain and measure whether the resulting generation quality and efficiency gains disappear or fall below those of standard quantization methods without the frame weighting or dual-scale components.

Figures

Figures reproduced from arXiv: 2605.21072 by Gongfan Fang, Siao Tang, Xinchao Wang, Xingyi Yang, Xinyin Ma.

**Figure 2.** Figure 2: Quantization sensitivity patterns in autoregressive video diffusion models, with scores [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The outlier patterns in autoregressive video diffusion models. The x-axis denotes input [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The ratio of layers containing outliers in terms of layer type and block depth. A layer is [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The visual comparison of the self-forcing model with W4A8. Additional samples are [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: The sensitivity to threshold τ of Modified Z-score using self-forcing. 0.00 0.25 0.50 CV 0.0060 0.0053 0.0012 0.0235 0.0231 0.0092 0.5873 0.1685 Coefficient of Variation (CV) 0.0 0.5 1.0 BOA 0.3333 0.5000 0.0000 0.8333 0.0000 0.5000 1.0000 1.0000 Bitwidth-Order Agreement (BOA) Subj. Cons. Back. Cons. Motion Smooth. Aesth. Qual. Imag. Qual. Avg. FVD-FP LPIPS-FP 0.00 0.25 0.50 DS 0.0020 0.0026 0.0000 0.0196… view at source ↗

**Figure 8.** Figure 8: Outlier patterns of all layers in block 0. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Outlier patterns of all layers in block 10. [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Outlier patterns of all layers in block 29. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Visual comparison of the self-forcing model using W4A8. [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Visual comparison of the self-forcing model using W4A6. [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Visual comparison of the self-forcing model using W8A8. [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Visual comparison of the causal-forcing model using W4A8. [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

**Figure 15.** Figure 15: Visual comparison of the causal-forcing model using W4A6. [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: Visual comparison of the causal-forcing model using W8A8. [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗

read the original abstract

Autoregressive video diffusion models (ARVDs) have emerged as a promising architecture for streaming video generation, paving the way for real-time interactive video generation and world modeling. Despite their potential, the substantial inference cost of ARVDs remains a major obstacle to practical deployment, making model quantization a natural direction for improving efficiency. However, quantization for ARVDs remains largely unexplored. Our empirical analysis shows that directly applying existing quantization schemes developed for standard diffusion transformers to ARVDs leads to suboptimal performance, revealing quantization behaviors that differ from those observed in bidirectional diffusion models. In this paper, we identify two critical challenges in quantizing ARVDs: (C1) Highly unbalanced frame-wise quantization sensitivity. Error accumulation during autoregressive generation can induce severely skewed quantization sensitivity across frames, following an exponential-like decay pattern. (C2) Prominent and heterogeneous outlier patterns in weights. Weight distributions exhibit pronounced outlier channels, whose patterns vary substantially across layer types and block depths. To address these issues, we propose Q-ARVD, a novel framework for accurate ARVD quantization. (S1) To tackle the highly unbalanced frame-wise sensitivity, Q-ARVD incorporates a final-quality aware frame-weighting mechanism into the quantization objective. (S2) To prevent heterogeneous outliers from degrading performance, Q-ARVD introduces an outlier-aware adaptive dual-scale quantization, which automatically detects the presence and quantity of outlier channels for an arbitrary layer, and isolates them to protect normal channels. Extensive experiments demonstrate the superiority of Q-ARVD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Q-ARVD gives practical fixes for quantizing autoregressive video diffusion by weighting frames for final quality and using per-layer dual-scale outlier handling, but the transfer claims rest on limited reported validation.

read the letter

The main point is that standard quantization schemes fall short on autoregressive video diffusion models because of frame-wise error buildup and layer-varying outliers, and the authors supply two targeted adjustments that improve results in their tests. The frame-weighting scheme prioritizes later frames in the quantization loss to protect final output quality, while the dual-scale method spots outlier channels automatically and handles them separately from the rest of the weights. These steps address the exponential-like sensitivity decay across generated frames and the heterogeneous outlier patterns that change with layer type and depth. That combination looks like the actual novelty relative to earlier diffusion-transformer quantization work. The empirical analysis in the paper is useful for showing why direct transfer from bidirectional models does not work here. The experiments claim clear gains, which is the kind of concrete evidence that matters for deployment questions. The soft spots are around generalization. The abstract and stress-test note do not point to results on ARVD variants with different parameter counts, training domains, or longer generation sequences, so it is unclear whether the weighting coefficients or outlier thresholds need retuning for new models. If those hyperparameters were fitted mainly on the primary setup, the superiority may shrink elsewhere. The method stays empirical rather than deriving closed-form rules, which is fine for an engineering paper but leaves the robustness question open. This work is aimed at researchers and engineers who need lower latency for streaming video generation or world models. A reader already working on efficient diffusion or model compression will find the specific challenges and fixes worth examining. It is solid enough on its own terms to merit peer review; the gap it targets is real and the proposals are testable even if some cross-model checks would strengthen the case.

Referee Report

3 major / 2 minor

Summary. The paper proposes Q-ARVD for quantizing autoregressive video diffusion models (ARVDs). It identifies two ARVD-specific challenges: (C1) highly unbalanced frame-wise quantization sensitivity following an exponential-like decay due to error accumulation in autoregressive generation, and (C2) prominent heterogeneous outlier patterns in weights that vary by layer type and block depth. To address these, Q-ARVD adds (S1) a final-quality aware frame-weighting mechanism to the quantization objective and (S2) an outlier-aware adaptive dual-scale quantization that automatically detects outlier channels and isolates them. Extensive experiments are reported to show superiority over direct application of existing diffusion transformer quantization schemes.

Significance. If the empirical gains hold under broader testing, the work is significant for reducing the inference cost of ARVDs, which are positioned for real-time streaming video generation and world modeling. The identification of quantization behaviors unique to the autoregressive setting (as opposed to bidirectional diffusion) is a useful empirical contribution. The proposed mechanisms aim to be adaptive rather than manually tuned per layer, which could aid practical deployment if they generalize.

major comments (3)

[Experiments] Experiments section: superiority is demonstrated on the primary ARVD model, yet no results are shown for ARVD variants differing in parameter count, training domain, or generation length. This is load-bearing for the central claim that the frame-weighting schedule and dual-scale outlier detection generalize without per-model retuning.
[§4.2] §4.2 (frame-weighting mechanism): the final-quality aware weighting is motivated by the observed exponential decay, but the weighting coefficients appear among the free parameters; it is unclear whether they are held fixed across models or fitted on the evaluation set, which directly affects whether the method is parameter-light as presented.
[§4.3] §4.3 (dual-scale quantization): the claim that the method 'automatically detects the presence and quantity of outlier channels for an arbitrary layer' is undercut if outlier detection thresholds are among the free parameters that may require adjustment; a concrete ablation showing performance when thresholds are frozen versus re-tuned would clarify this.

minor comments (2)

[§4.3] Notation for the dual-scale factors (e.g., how the normal-channel scale and outlier-channel scale are computed) could be made more explicit with a short equation or pseudocode block.
[Figure 3] Figure captions for the outlier distribution plots should state the exact layer indices and model variant used so readers can reproduce the heterogeneous pattern observation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment point by point below. Where revisions strengthen the presentation without altering the core contributions, we have incorporated changes in the revised version.

read point-by-point responses

Referee: [Experiments] Experiments section: superiority is demonstrated on the primary ARVD model, yet no results are shown for ARVD variants differing in parameter count, training domain, or generation length. This is load-bearing for the central claim that the frame-weighting schedule and dual-scale outlier detection generalize without per-model retuning.

Authors: We acknowledge that the reported experiments center on the primary ARVD model. The frame-weighting schedule is derived directly from the exponential-like decay pattern that arises from error accumulation, a property inherent to autoregressive generation rather than model-specific details. Likewise, the dual-scale quantization operates on per-layer statistical properties and requires no manual retuning. While we agree that results on additional variants would further substantiate broad applicability, the central claim rests on the identification of these ARVD-specific behaviors and the adaptive design of the proposed mechanisms. In the revised manuscript we have added a dedicated discussion subsection clarifying the expected generalization and noting the scope of current experiments as a limitation for future work. revision: partial
Referee: [§4.2] §4.2 (frame-weighting mechanism): the final-quality aware weighting is motivated by the observed exponential decay, but the weighting coefficients appear among the free parameters; it is unclear whether they are held fixed across models or fitted on the evaluation set, which directly affects whether the method is parameter-light as presented.

Authors: The weighting coefficients are computed analytically from the exponential decay observed in the frame-wise sensitivity analysis and are held constant across all models, datasets, and generation lengths. They are not optimized or fitted on any evaluation data. This choice preserves the parameter-light character of the approach. We have revised §4.2 to state this explicitly, including the precise formula used to obtain the fixed coefficients from the sensitivity curve. revision: yes
Referee: [§4.3] §4.3 (dual-scale quantization): the claim that the method 'automatically detects the presence and quantity of outlier channels for an arbitrary layer' is undercut if outlier detection thresholds are among the free parameters that may require adjustment; a concrete ablation showing performance when thresholds are frozen versus re-tuned would clarify this.

Authors: Outlier detection relies on fixed, distribution-based thresholds (multiples of per-channel standard deviation) that are not adjusted per layer, model, or dataset. The number of outlier channels is then determined automatically by counting channels that exceed these fixed thresholds. To directly address the concern, the revised manuscript includes a new ablation table comparing performance under the frozen thresholds versus a version where thresholds are re-tuned per layer; the results show negligible difference, confirming that the fixed-threshold design suffices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical solution derived from observed failure modes

full rationale

The paper performs an empirical analysis to identify two quantization challenges specific to ARVDs (unbalanced frame-wise sensitivity and heterogeneous outliers), then introduces targeted mechanisms (final-quality aware frame-weighting and outlier-aware adaptive dual-scale quantization) to mitigate them. These are presented as engineering responses validated through experiments rather than any closed-form derivation, mathematical prediction, or self-referential definition. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text that would reduce the claimed superiority to the inputs by construction. The approach remains self-contained as an applied method without circular reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework rests on the empirical observation that ARVD quantization exhibits frame-wise sensitivity decay and layer-specific outlier patterns; these observations are treated as given rather than derived, and the adaptive mechanisms likely introduce a small number of detection thresholds or weighting coefficients whose values are not specified in the abstract.

free parameters (2)

frame-weighting coefficients
Final-quality aware weighting mechanism implies per-frame or per-position scalars that are either learned or chosen to emphasize later frames.
outlier detection thresholds
Adaptive dual-scale quantization requires automatic detection of outlier channels, which typically involves one or more magnitude or percentile thresholds per layer.

axioms (1)

domain assumption Existing quantization schemes for bidirectional diffusion transformers transfer poorly to autoregressive video models due to error accumulation.
Stated as the starting point for identifying C1 and C2.

pith-pipeline@v0.9.0 · 5811 in / 1377 out tokens · 23418 ms · 2026-05-21T05:26:14.340031+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

final-quality aware frame-weighting mechanism ... outlier-aware adaptive dual-scale quantization, which automatically detects the presence and quantity of outlier channels ... Modified Z-score ... MAD
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

exponential-like decay pattern ... sensitivity score of chunk 1 ... last chunk is less than 0.01

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 14 internal anchors

[1]

SkyReels-V2: Infinite-length Film Generative Model

Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074. Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang

work page internal anchor Pith review Pith/arXiv arXiv
[2]

InThe Thirteenth International Conference on Learning Representations

Autoregressive video generation without vector quantiza- tion. InThe Thirteenth International Conference on Learning Representations. Tianrui Feng, Zhi Li, Shuo Yang, Haocheng Xi, Muyang Li, Xiuyu Li, Lvmin Zhang, Keting Yang, Kelly Peng, Song Han, et al. 2025a. Streamdiffusionv2: A streaming system for dynamic and interactive video generation.arXiv prepr...

work page arXiv
[3]

LTX-Video: Realtime Video Latent Diffusion

Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103. Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman

Ptqd: Accurate post-training quantization for diffusion models.arXiv preprint arXiv:2305.10657. Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. 2025a. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009. Yushi Huang, Ruihao Gong, Jing Liu, Tianlong Chen, and Xianglong Liu. 2024a. ...

work page arXiv 1993
[5]

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954. Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, and Sung Ju Hwang

work page arXiv
[6]

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al

Avatar forcing: Real-time interactive head avatar generation for natural conversation.arXiv preprint arXiv:2601.00664. Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al

work page arXiv
[7]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603. Raghuraman Krishnamoorthi

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Quantizing deep convolutional networks for efficient inference: A whitepaper

Quantizing deep convolutional networks for efficient inference: A whitepaper.arXiv preprint arXiv:1806.08342. 10 Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. 2025a. Svdquant: Absorbing outliers by low-rank component for 4-bit diffusion models. InThe Thirteenth International Confe...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Dvd-quant: Data-free video diffusion transformers quantization.arXiv preprint arXiv:2505.18663, 2025

Brecq: Pushing the limit of post-training quantization by block reconstruction. In International Conference on Learning Representations. Zhiteng Li, Hanxuan Li, Junyi Wu, Kai Liu, Haotong Qin, Linghe Kong, Guihai Chen, Yulun Zhang, and Xiaokang Yang. 2025b. Dvd-quant: Data-free video diffusion transformers quantization.arXiv preprint arXiv:2505.18663. Kun...

work page arXiv
[10]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161. Xuewen Liu, Zhikai Li, Jing Zhang, Mengjuan Chen, and Qingyi Gu

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang

Ptq4arvg: Post-training quantization for autoregressive visual generation models.arXiv preprint arXiv:2601.21238. Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang

work page arXiv
[12]

Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096,

Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096. Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort

work page arXiv
[13]

A White Paper on Neural Network Quantization

A white paper on neural network quantization.arXiv preprint arXiv:2106.08295. William Peebles and Saining Xie

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Movie Gen: A Cast of Media Foundation Models

Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720. Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan

work page internal anchor Pith review Pith/arXiv arXiv
[15]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1972–1981

Post-training quantization on diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1972–1981. Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, and Xun Huang

work page 1972
[16]

Motionstream: Real-time video gen- eration with interactive motion controls.arXiv preprint arXiv:2511.01266,

Motionstream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266. Junhyuk So, Jungwon Lee, Daehyun Ahn, Hyungjun Kim, and Eunhyeok Park

work page arXiv
[17]

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo

Temporal dynamic quantization for diffusion models.arXiv preprint arXiv:2306.02316. Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo

work page arXiv
[18]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614. Siao Tang, Xin Wang, Hong Chen, Chaoyu Guan, Zewen Wu, Yansong Tang, and Wenwu Zhu

work page internal anchor Pith review Pith/arXiv arXiv
[19]

MAGI-1: Autoregressive Video Generation at Scale

Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211. 11 Philippe Tillet and David Cox

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717. Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314. Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al

work page internal anchor Pith review Pith/arXiv arXiv
[22]

HunyuanVideo 1.5 Technical Report

Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870. Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, and Yan Yan

work page internal anchor Pith review Pith/arXiv arXiv
[23]

LongLive: Real-time Interactive Long Video Generation

Smoothquant: Accurate and efficient post-training quantization for large language models. In International conference on machine learning, pages 38087–38099. PMLR. Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. 2025a. Longlive: Real-time interactive long video generation.arX...

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self- rollout.arXiv preprint arXiv:2511.20649, 2025

Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self- rollout.arXiv preprint arXiv:2511.20649. Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim

work page arXiv
[25]

H., Nam, J., Yoon, H., and Kim, S

Deep forcing: Training-free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081. Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang

work page arXiv
[26]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214. 12 A Additional Visualization of Outlier Patterns Figure 8, Figure 9, and Figure 10 show the outlier patterns of all 10 layers in block 0, 10, and

work page internal anchor Pith review arXiv
[27]

0 250 500 750 1000 1250 15000.0 0.5 1.0 1.5 2.0 2.5L2 Norm blocks.0.self_attn.q outlier n=128 (8.3%) median=1.059 threshold=1.343 0 250 500 750 1000 1250 15000.0 0.5 1.0 1.5 2.0 2.5 3.0 blocks.0.self_attn.k outlier n=128 (8.3%) median=1.036 threshold=1.326 0 250 500 750 1000 1250 15000.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 blocks.0.self_attn.v outlier n=96 (6.2%)...

work page 2000
[28]

0 250 500 750 1000 1250 15000.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75L2 Norm blocks.10.self_attn.q outlier n=32 (2.1%) median=1.459 threshold=1.750 0 250 500 750 1000 1250 15000.0 0.5 1.0 1.5 2.0 2.5 blocks.10.self_attn.k outlier n=64 (4.2%) median=1.474 threshold=1.912 0 250 500 750 1000 1250 15000.0 0.5 1.0 1.5 2.0 blocks.10.self_attn.v median=1.428 thres...

work page 2000
[29]

0 250 500 750 1000 1250 15000.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75L2 Norm blocks.29.self_attn.q outlier n=32 (2.1%) median=1.043 threshold=1.697 0 250 500 750 1000 1250 15000.0 0.5 1.0 1.5 2.0 2.5 blocks.29.self_attn.k outlier n=32 (2.1%) median=0.997 threshold=1.688 0 250 500 750 1000 1250 15000 1 2 3 4 blocks.29.self_attn.v outlier n=96 (6.2%) median=1...

work page 2000
[30]

We divide the quantization into two kernels

to implement quantization kernels. We divide the quantization into two kernels. (i) Activation quantization kernel, which quantizes input float activations to INT8. (ii) INT8 GEMM and de-quantization kernel, which completes matrix multiplication of INT8 weight and INT8 activation, and also performs de-quantization based on scales. For our dual-scale quant...

work page 2023
[31]

and non-linear activation functions, while online rotation incurs notable extra overhead (Li et al., 2025a,b; Liu et al., 2026). Finally, low-rank branch methods, such as SVDQuant (Li et al., 2025a), mitigate weight outliers by absorbing them into a high-precision low-rank branch, which inevitably incurs additional computational overhead. Our outlier adap...

work page 2026

[1] [1]

SkyReels-V2: Infinite-length Film Generative Model

Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074. Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

InThe Thirteenth International Conference on Learning Representations

Autoregressive video generation without vector quantiza- tion. InThe Thirteenth International Conference on Learning Representations. Tianrui Feng, Zhi Li, Shuo Yang, Haocheng Xi, Muyang Li, Xiuyu Li, Lvmin Zhang, Keting Yang, Kelly Peng, Song Han, et al. 2025a. Streamdiffusionv2: A streaming system for dynamic and interactive video generation.arXiv prepr...

work page arXiv

[3] [3]

LTX-Video: Realtime Video Latent Diffusion

Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103. Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman

Ptqd: Accurate post-training quantization for diffusion models.arXiv preprint arXiv:2305.10657. Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. 2025a. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009. Yushi Huang, Ruihao Gong, Jing Liu, Tianlong Chen, and Xianglong Liu. 2024a. ...

work page arXiv 1993

[5] [5]

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954. Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, and Sung Ju Hwang

work page arXiv

[6] [6]

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al

Avatar forcing: Real-time interactive head avatar generation for natural conversation.arXiv preprint arXiv:2601.00664. Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al

work page arXiv

[7] [7]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603. Raghuraman Krishnamoorthi

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Quantizing deep convolutional networks for efficient inference: A whitepaper

Quantizing deep convolutional networks for efficient inference: A whitepaper.arXiv preprint arXiv:1806.08342. 10 Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. 2025a. Svdquant: Absorbing outliers by low-rank component for 4-bit diffusion models. InThe Thirteenth International Confe...

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Dvd-quant: Data-free video diffusion transformers quantization.arXiv preprint arXiv:2505.18663, 2025

Brecq: Pushing the limit of post-training quantization by block reconstruction. In International Conference on Learning Representations. Zhiteng Li, Hanxuan Li, Junyi Wu, Kai Liu, Haotong Qin, Linghe Kong, Guihai Chen, Yulun Zhang, and Xiaokang Yang. 2025b. Dvd-quant: Data-free video diffusion transformers quantization.arXiv preprint arXiv:2505.18663. Kun...

work page arXiv

[10] [10]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161. Xuewen Liu, Zhikai Li, Jing Zhang, Mengjuan Chen, and Qingyi Gu

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang

Ptq4arvg: Post-training quantization for autoregressive visual generation models.arXiv preprint arXiv:2601.21238. Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang

work page arXiv

[12] [12]

Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096,

Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096. Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort

work page arXiv

[13] [13]

A White Paper on Neural Network Quantization

A white paper on neural network quantization.arXiv preprint arXiv:2106.08295. William Peebles and Saining Xie

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Movie Gen: A Cast of Media Foundation Models

Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720. Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1972–1981

Post-training quantization on diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1972–1981. Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, and Xun Huang

work page 1972

[16] [16]

Motionstream: Real-time video gen- eration with interactive motion controls.arXiv preprint arXiv:2511.01266,

Motionstream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266. Junhyuk So, Jungwon Lee, Daehyun Ahn, Hyungjun Kim, and Eunhyeok Park

work page arXiv

[17] [17]

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo

Temporal dynamic quantization for diffusion models.arXiv preprint arXiv:2306.02316. Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo

work page arXiv

[18] [18]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614. Siao Tang, Xin Wang, Hong Chen, Chaoyu Guan, Zewen Wu, Yansong Tang, and Wenwu Zhu

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

MAGI-1: Autoregressive Video Generation at Scale

Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211. 11 Philippe Tillet and David Cox

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717. Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314. Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

HunyuanVideo 1.5 Technical Report

Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870. Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, and Yan Yan

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

LongLive: Real-time Interactive Long Video Generation

Smoothquant: Accurate and efficient post-training quantization for large language models. In International conference on machine learning, pages 38087–38099. PMLR. Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. 2025a. Longlive: Real-time interactive long video generation.arX...

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self- rollout.arXiv preprint arXiv:2511.20649, 2025

Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self- rollout.arXiv preprint arXiv:2511.20649. Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim

work page arXiv

[25] [25]

H., Nam, J., Yoon, H., and Kim, S

Deep forcing: Training-free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081. Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang

work page arXiv

[26] [26]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214. 12 A Additional Visualization of Outlier Patterns Figure 8, Figure 9, and Figure 10 show the outlier patterns of all 10 layers in block 0, 10, and

work page internal anchor Pith review arXiv

[27] [27]

0 250 500 750 1000 1250 15000.0 0.5 1.0 1.5 2.0 2.5L2 Norm blocks.0.self_attn.q outlier n=128 (8.3%) median=1.059 threshold=1.343 0 250 500 750 1000 1250 15000.0 0.5 1.0 1.5 2.0 2.5 3.0 blocks.0.self_attn.k outlier n=128 (8.3%) median=1.036 threshold=1.326 0 250 500 750 1000 1250 15000.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 blocks.0.self_attn.v outlier n=96 (6.2%)...

work page 2000

[28] [28]

0 250 500 750 1000 1250 15000.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75L2 Norm blocks.10.self_attn.q outlier n=32 (2.1%) median=1.459 threshold=1.750 0 250 500 750 1000 1250 15000.0 0.5 1.0 1.5 2.0 2.5 blocks.10.self_attn.k outlier n=64 (4.2%) median=1.474 threshold=1.912 0 250 500 750 1000 1250 15000.0 0.5 1.0 1.5 2.0 blocks.10.self_attn.v median=1.428 thres...

work page 2000

[29] [29]

0 250 500 750 1000 1250 15000.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75L2 Norm blocks.29.self_attn.q outlier n=32 (2.1%) median=1.043 threshold=1.697 0 250 500 750 1000 1250 15000.0 0.5 1.0 1.5 2.0 2.5 blocks.29.self_attn.k outlier n=32 (2.1%) median=0.997 threshold=1.688 0 250 500 750 1000 1250 15000 1 2 3 4 blocks.29.self_attn.v outlier n=96 (6.2%) median=1...

work page 2000

[30] [30]

We divide the quantization into two kernels

to implement quantization kernels. We divide the quantization into two kernels. (i) Activation quantization kernel, which quantizes input float activations to INT8. (ii) INT8 GEMM and de-quantization kernel, which completes matrix multiplication of INT8 weight and INT8 activation, and also performs de-quantization based on scales. For our dual-scale quant...

work page 2023

[31] [31]

and non-linear activation functions, while online rotation incurs notable extra overhead (Li et al., 2025a,b; Liu et al., 2026). Finally, low-rank branch methods, such as SVDQuant (Li et al., 2025a), mitigate weight outliers by absorbing them into a high-precision low-rank branch, which inevitably incurs additional computational overhead. Our outlier adap...

work page 2026