pith. sign in

arxiv: 2606.03972 · v2 · pith:F6O6EAFPnew · submitted 2026-06-02 · 💻 cs.CV

AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

Pith reviewed 2026-06-28 10:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords asymmetric adversarial distillationone-step autoregressive video generationmotion collapseimage-to-video synthesisbidirectional discriminatorphased training strategy
0
0 comments X

The pith

An asymmetric discriminator attending bidirectionally over full video context prevents motion collapse in one-step autoregressive image-to-video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that breaking symmetry between a causal generator and a bidirectional discriminator, plus a phased training approach, solves motion collapse and instability in one-step autoregressive video models. A sympathetic reader would care because symmetric adversarial distillation produces static or drifting videos that limit fast generation from single images. The discriminator's full spatiotemporal view and single holistic score let it catch global temporal failures that symmetric setups miss. If correct, one-step models could reach the motion quality of slower multi-step methods while keeping autoregressive sampling intact.

Core claim

The central claim is that an asymmetric adversarial distillation framework, with a causal generator preserving autoregressive sampling and a bidirectional discriminator producing one holistic realism score over the entire sequence, combined with an initial distribution-matching warm-up phase, enables stable one-step autoregressive image-to-video generation by detecting and penalizing long-range drift and motion collapse, reaching state-of-the-art results on VBench.

What carries the argument

The asymmetric discriminator that attends bidirectionally over full spatiotemporal context to output a single holistic realism score for the whole video.

If this is right

  • The one-step generator produces coherent motion without collapse while remaining autoregressive at inference.
  • The initial distribution-matching phase brings the student close enough to the teacher for subsequent adversarial training to succeed.
  • Global temporal failures become detectable because the discriminator sees the complete sequence rather than local patches.
  • One-step autoregressive video generation reaches performance levels previously limited to multi-step approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same asymmetry might apply to autoregressive generation in other domains such as audio sequences or 3D motion.
  • Reducing reliance on multi-step sampling could lower inference cost for video models if the phased strategy transfers.
  • Future work could test whether the holistic score generalizes to longer videos where drift accumulates over more frames.

Load-bearing premise

That a bidirectional discriminator can reliably detect and penalize motion collapse and long-range drift without introducing new instabilities or forcing changes to the causal generator's sampling process.

What would settle it

Videos from the one-step model show no gains in motion coherence or long-range consistency metrics over symmetric distillation baselines, or training diverges when the bidirectional discriminator is added.

Figures

Figures reproduced from arXiv: 2606.03972 by Haobo Li, Hao Ouyang, Jiapeng Zhu, Ka Leong Cheng, Qiuyu Wang, Yanhong Zeng, Yujun Shen, Yunhong Lu, Zhipeng Zhang.

Figure 1
Figure 1. Figure 1: We propose AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive video generation. Given a single conditioning image, AAD-1 generates videos autoregressively while maintaining both high visual quality and motion fidelity over long horizons, requiring only one sampling step per chunk. Abstract We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autor… view at source ↗
Figure 2
Figure 2. Figure 2: Discriminator Architecture Comparison. We compare three configurations: (a) Causal backbone with frame-wise log￾its, providing dense local feedback but lacking global temporal context; (b) Causal backbone with video-level logit, aggregat￾ing information causally but still constrained by unidirectional attention; and (c) Bidirectional backbone with video-level logit (AAD-1), which attends to the full spatio… view at source ↗
Figure 3
Figure 3. Figure 3: Training Pipeline. We train a one-step autoregressive generator Gθ through three stages. (a) Stage I: ODE initialization replaces bidirectional attention in pre-trained video models with block-wise causal attention, trained by diffusion-forcing with flow￾matching loss. (b) Stage II: One-step DMD Warmup distills a strong diffusion teacher under self-rollout training by matching real and fake scores, bringin… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison. We compare our method against autoregressive baselines using 4-NFE sampling (CausVid (Yin et al., 2025) and Self Forcing (Huang et al., 2025)). Given a conditioning image of a swimming jellyfish, our method synthesizes vivid motion while maintaining visual fidelity and identity consistency over long horizons (up to 320 frames), whereas baselines exhibit identity drift [PITH_FULL_IM… view at source ↗
Figure 5
Figure 5. Figure 5: User Preference Study. Win rates of our method against baselines (Self Forcing, CausVid). Our method is preferred in the majority among these methods. initial frame w DMD warmup w/o DMD warmup [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Stage-wise ablation of DMD warmup. DMD warmup helps stabilize subsequent adversarial refinement and prevents severe visual degradation. Ablation on DMD warmup. We ablate the DMD warmup stage to verify whether adversarial refinement alone can re￾liably train a one-step autoregressive generator. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative ablation study. We compare generated motion under four settings: (a) Causal backbone w/ frame-wise logits results in completely static videos; (b) Causal backbone w/ video-wise logit and (c) Bidirectional backbone w/ frame-wise logits are both prone to drift, exhibiting erratic camera movement, excessive speed, or color shifts. (d) Bidirectional backbone w/ video-wise logit (Ours) achieves the … view at source ↗
Figure 8
Figure 8. Figure 8: Drift in Causal Video Diffusion Model. Long-horizon rollout from the full-step causal teacher. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effect of regularization coefficient λ. Without reg￾ularization (λ = 0), training collapses. Excessive regulariza￾tion (λ = 50) introduces grid-like patterns. The optimal setting (λ = 20) balances stability and visual quality. Analysis of regularization coefficient. Beyond architec￾tural choices, we find that the regularization coefficient λ plays a critical role in training stability. As illustrated in Fi… view at source ↗
read the original abstract

We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video generation. State-of-the-art methods adopt adversarial distillation but suffer from motion collapse and training instability, resulting in static videos. AAD-1 addresses these challenges through two key designs in architecture and training strategy. Our key architectural insight is to break the symmetry between generator and discriminator. While the generator remains causal to preserve autoregressive sampling capability, the discriminator attends bidirectionally over the full spatiotemporal context and produces a single holistic realism score for the entire video sequence. This asymmetric design enables the discriminator to effectively detect global temporal failures and long-range drift that cause motion collapse in autoregressive generation. To stabilize training, we introduce a phased strategy that first uses distribution matching to bootstrap a stable one-step generator, providing a warm-up phase that brings the student distribution closer to the teacher before adversarial distillation begins. Extensive experiments on VBench demonstrate that AAD-1 achieves state-of-the-art performance in one-step autoregressive video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper introduces AAD-1, an Asymmetric Adversarial Distillation framework for one-step autoregressive image-to-video generation. It features a causal generator paired with a bidirectional discriminator that evaluates the full spatiotemporal context to produce a single holistic realism score. A phased training strategy is proposed, beginning with distribution matching to stabilize the one-step generator before transitioning to adversarial distillation. The authors claim this approach mitigates motion collapse and long-range drift, achieving state-of-the-art performance on the VBench benchmark.

Significance. Should the empirical claims be substantiated with rigorous comparisons and ablations, this work could contribute meaningfully to the development of efficient video generation models. The asymmetric architecture allows the discriminator to target global temporal issues without compromising the causal nature of the generator at inference time. The phased training strategy addresses a common challenge in adversarial training of generative models. Credit is given for the clear architectural insight and the practical training schedule.

minor comments (1)
  1. [Abstract] Abstract: The abstract asserts SOTA results on VBench but does not provide any specific metric values, baseline comparisons, or dataset details. Including key quantitative results would better support the central claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the supportive summary, recognition of the asymmetric architecture and phased training contributions, and the recommendation for minor revision. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an asymmetric architecture (causal generator + bidirectional discriminator) and a phased training schedule (distribution matching warm-up followed by adversarial distillation) as explicit design choices. No equations, fitted parameters, or self-citations are presented that reduce the claimed performance gains or the detection of motion collapse to quantities defined by the method itself. Claims rest on external benchmark results (VBench) rather than internal self-consistency that would indicate circularity. The derivation chain is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the framework implicitly relies on standard assumptions of adversarial training stability and the existence of a suitable teacher model for distillation.

pith-pipeline@v0.9.1-grok · 5742 in / 1079 out tokens · 22884 ms · 2026-06-28T10:22:27.538365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models

    cs.CV 2026-06 unverdicted novelty 6.0

    Causal-rCM unifies teacher-forcing and self-forcing distillation for autoregressive video diffusion, delivering a 2-step model with VBench-T2V score 84.63 and enabling interactive world models on Cosmos 3 using only s...

Reference graph

Works this paper leans on

23 extracted references · 22 canonical work pages · cited by 1 Pith paper · 13 internal anchors

  1. [1]

    SkyReels-V2: Infinite-length Film Generative Model

    Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al. Skyreels- v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074,

  2. [2]

    H., Yu, K., Zhang, P., Li, W., Zhou, Y ., Zheng, T., and Lu, Q

    Cheng, J., Ma, B., Ren, X., Jin, H. H., Yu, K., Zhang, P., Li, W., Zhou, Y ., Zheng, T., and Lu, Q. Phased one-step adversarial equilibrium for video diffusion models.arXiv preprint arXiv:2508.21019,

  3. [3]

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y ., and Hsieh, C.-J. Self-forcing++: To- wards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283,

  4. [4]

    The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568,

    Feng, R., Zhang, H., Yang, Z., Xiao, J., Shu, Z., Liu, Z., Zheng, A., Huang, Y ., Liu, Y ., and Zhang, H. The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568,

  5. [5]

    Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040,

    Hong, Y ., Mei, Y ., Ge, C., Xu, Y ., Zhou, Y ., Bi, S., Hold- Geoffroy, Y ., Roberts, M., Fisher, M., Shechtman, E., et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040,

  6. [6]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Huang, X., Li, Z., He, G., Zhou, M., and Shechtman, E. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009,

  7. [7]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Jacobs, S. A., Tanaka, M., Zhang, C., Zhang, M., Song, S. L., Rajbhandari, S., and He, Y . Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509,

  8. [8]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al. Hunyuan- video: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

  9. [9]

    Open-Sora Plan: Open-Source Large Video Generation Model

    Lin, B., Ge, Y ., Cheng, X., Li, Z., Zhu, B., Wang, S., He, X., Ye, Y ., Yuan, S., Chen, L., et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131,

  10. [10]

    Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025a

    Lin, S., Xia, X., Ren, Y ., Yang, C., Xiao, X., and Jiang, L. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025a. Lin, S., Yang, C., He, H., Jiang, J., Ren, Y ., Xia, X., Zhao, Y ., Xiao, X., and Jiang, L. Autoregressive adversarial post-training for real-time interactive video generation. arXiv prepri...

  11. [11]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003,

  12. [12]

    Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

    Lu, Y ., Ren, Y ., Xia, X., Lin, S., Wang, X., Xiao, X., Ma, A. J., Xie, X., and Lai, J.-H. Adversarial distribution matching for diffusion distillation towards efficient image and video synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16818– 16829, 2025a. 10 AAD-1: Asymmetric Adversarial Distillation for One-Step ...

  13. [13]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,

  14. [14]

    Magicdistillation: Weak-to-strong video distillation for large-scale few-step synthesis.arXiv preprint arXiv:2503.13319,

    Shao, S., Yi, H., Guo, H., Ye, T., Zhou, D., Lingelbach, M., Xu, Z., and Xie, Z. Magicdistillation: Weak-to-strong video distillation for large-scale few-step synthesis.arXiv preprint arXiv:2503.13319,

  15. [15]

    MAGI-1: Autoregressive Video Generation at Scale

    Teng, H., Jia, H., Sun, L., Li, L., Li, M., Tang, M., Han, S., Zhang, T., Zhang, W., Luo, W., et al. Magi-1: Au- toregressive video generation at scale.arXiv preprint arXiv:2505.13211,

  16. [16]

    Flow map dis- tillation without data.arXiv preprint arXiv:2511.19428,

    Tong, S., Ma, N., Xie, S., and Jaakkola, T. Flow map dis- tillation without data.arXiv preprint arXiv:2511.19428,

  17. [17]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

  18. [18]

    Efficient Streaming Language Models with Attention Sinks

    Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453,

  19. [19]

    Live2Diff: Live stream translation via uni-directional attention in video diffusion models,

    Xing, Z., Fox, G., Zeng, Y ., Pan, X., Elgharib, M., Theobalt, C., and Chen, K. Live2diff: Live stream translation via uni-directional attention in video diffusion models.arXiv preprint arXiv:2407.08701,

  20. [20]

    LongLive: Real-time Interactive Long Video Generation

    Yang, S., Huang, W., Chu, R., Xiao, Y ., Zhao, Y ., Wang, X., Li, M., Xie, E., Chen, Y ., Lu, Y ., et al. Longlive: Real- time interactive long video generation.arXiv preprint arXiv:2509.22622,

  21. [21]

    Lumos-1: On autoregressive video generation from a unified model perspective.arXiv preprint arXiv:2507.08801,

    Yuan, H., Chen, W., Cen, J., Yu, H., Liang, J., Chang, S., Lin, Z., Feng, T., Liu, P., Xing, J., et al. Lumos-1: On autoregressive video generation from a unified model perspective.arXiv preprint arXiv:2507.08801,

  22. [22]

    Packing input frame context in next-frame prediction models for video generation

    Zhang, L. and Agrawala, M. Frame context packing and drift prevention in next-frame-prediction video diffusion models.arXiv preprint arXiv:2504.12626,

  23. [23]

    with context parallel size 8 together with PyTorch activation checkpointing. Under the same Stage III setup, namely 64 H20 GPUs, 8 GPUs per node, and Ulysses-style context parallelism with cp = 8 , the bidirectional discriminator adversarial training reaches a peak total GPU 13 AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video G...