Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation

Matthew Bendel; Mithilesh Vaidya; Stephen W. Bailey; Sumukh Badam; Xingzhe He

arxiv: 2605.20476 · v1 · pith:SIMU66AGnew · submitted 2026-05-19 · 💻 cs.CV

Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation

Matthew Bendel , Stephen W. Bailey , Mithilesh Vaidya , Sumukh Badam , Xingzhe He This is my paper

Pith reviewed 2026-05-21 07:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords long-horizon video generationdrift reductiontree samplingvideo-to-videoanchored imputationstatic camerainference-time scheduler

0 comments

The pith

Anchored Tree Sampling replaces sequential video rollout with a hierarchy of sparse anchors to bound drift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Anchored Tree Sampling as a training-free method for long-horizon video-to-video generation. It first produces sparse anchor frames across the full sequence using the base model, then refines intermediate anchors recursively and fills the intervals between them. This structure limits how far errors can propagate compared with frame-by-frame autoregressive generation. A sympathetic reader would care because the approach enables longer coherent outputs in static-camera settings without retraining the underlying model.

Core claim

Anchored Tree Sampling is a training-free inference-time scheduler that replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A root call produces sparse anchors over the full horizon, recursive refinement generates intermediate anchors, and final leaf spans are synthesized between neighboring anchors. This reduces the critical path from K sequential rollout steps to L+1 tree-hierarchical steps and converts horizon-compounding drift into anchor-bounded drift. The method focuses on the static-camera regime where sparse anchors over the horizon are well approximated by the dense conditioning signal so that the base model can produce them without

What carries the argument

Anchored Tree Sampling, a scheduler that organizes generation as root anchors, recursive refinement, and leaf spans to confine drift between nearby references.

If this is right

ATS outperforms two contemporary autoregressive baselines on Wan 2.1 plus VACE across five conditioning modalities in overall quality and drift prevention.
The method supports stable generation of at least 40 minutes on LTX-2.3 across the same modalities.
The critical path length drops from K sequential steps to L plus one tree-hierarchical steps.
The paper proposes extending the approach to arbitrarily long text-to-video generation and to dynamic-camera and multi-shot regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The tree structure could reduce compounding errors in other autoregressive domains such as audio or long text sequences.
Pairing ATS with existing distillation techniques might further improve anchor quality without changing the inference schedule.
Explicit modeling of camera motion would be needed before the same anchoring logic applies reliably outside the static-camera case.

Load-bearing premise

Sparse anchors over the full horizon are well approximated by the dense conditioning signal so the base model can produce them without retraining.

What would settle it

Generate a long sequence where the initial sparse anchor frames visibly mismatch the conditioning signals at distant time points and measure whether continuity collapses between those anchors.

read the original abstract

Long-horizon video generation suffers from two intertwined issues. First, there is drift, where video quality degrades over time. Second, there are continuity issues which manifest as object permanence issues, or improperly rendering transient content (e.g., an object that appears in non-consecutive frames changing color/style). Recent work has focused on autoregressive distillation techniques that attack both problems simultaneously. We instead choose to focus on drift directly and introduce \textbf{Anchored Tree Sampling (ATS)}: a training-free inference-time scheduler that replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A root call produces sparse anchors over the full horizon, recursive refinement generates intermediate anchors, and final leaf spans are synthesized between neighboring anchors. This reduces the critical path from $K$ sequential rollout steps to $L+1$ tree-hierarchical steps and converts horizon-compounding drift into anchor-bounded drift. We focus on V2V generation in the \emph{static-camera} regime, where sparse anchors over the horizon are well approximated by the dense conditioning signal, and the base model can produce them without retraining. We evaluate ATS against two contemporary autoregressive baselines on Wan $2.1$ $+$ VACE, across five conditioning modalities (inpainting, outpainting, edge, pose, depth). We show that ATS outperforms both competitors in overall quality, as well as in drift prevention. We additionally demonstrate stable $\geq 40$-minute generation on LTX-$2.3$ across the same five modalities. We conclude by proposing a path forward to extend ATS to arbitrarily long T2V generation, as well as the dynamic-camera and multi-shot regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ATS gives a training-free tree scheduler that bounds drift in long V2V by generating sparse anchors first then refining hierarchically.

read the letter

The main point is that Anchored Tree Sampling replaces straight left-to-right rollout with a tree of sparse-to-dense imputations. A root step sets anchors across the full horizon, then recursive levels fill intermediates, and leaves synthesize the final spans. This shortens the longest dependency chain and turns compounding drift into errors bounded by anchor intervals. They run it on Wan 2.1 plus VACE and report better overall quality plus lower drift than two autoregressive baselines across inpainting, outpainting, edge, pose, and depth. They also show 40-minute clips on LTX-2.3 without retraining the base model. The static-camera focus is key because dense conditioning signals are assumed to let the unmodified model produce usable sparse anchors directly. That assumption lets the method stay training-free and practical. The tree structure itself is the clearest novelty here; it does not collapse to the distillation approaches cited in the abstract. One soft spot is the lack of separate checks on root-anchor quality versus horizon length. If the base model already drifts on those distant sparse points despite the conditioning, the tree simply propagates the problem rather than fixing it. The reported video-level metrics do not isolate this, so the bounding claim rests partly on the overall scores. This work is aimed at video synthesis researchers and engineers who need longer outputs from existing models in controlled settings. Anyone testing inference schedulers for stability will find the concrete recipe useful to replicate. It deserves a serious referee because the idea is distinct, the experiments cover multiple modalities, and the results are presented clearly enough to evaluate.

Referee Report

2 major / 2 minor

Summary. The paper introduces Anchored Tree Sampling (ATS), a training-free inference-time scheduler for long-horizon video-to-video generation. It replaces left-to-right autoregressive rollout with a tree-structured sparse-to-dense process: a root call generates sparse anchors over the full horizon from dense conditioning, recursive refinement produces intermediate anchors, and leaf spans are synthesized between anchors. This reduces the critical path from K sequential steps to L+1 hierarchical steps and converts compounding drift into anchor-bounded drift. The method targets the static-camera V2V regime on base models such as Wan 2.1 + VACE and is evaluated across five conditioning modalities (inpainting, outpainting, edge, pose, depth), reporting outperformance versus two autoregressive baselines in quality and drift metrics plus stable generation up to 40 minutes on LTX-2.3.

Significance. If the central results hold, ATS would represent a practical advance for extending video generation horizons without retraining or distillation. The approach is parameter-free, inference-only, and structurally converts the drift problem into one of anchor quality; these are clear strengths. The demonstration of multi-modality applicability and long-horizon stability (40+ minutes) could influence inference scheduling practices in video synthesis if the anchor-bounding claim is substantiated.

major comments (2)

[Abstract and §3] Abstract and §3 (Method): The load-bearing claim that ATS converts 'horizon-compounding drift into anchor-bounded drift' rests on the unmodified base model producing faithful sparse root anchors from dense conditioning signals at arbitrary intervals. No experiment or table reports anchor quality (e.g., frame-wise fidelity or perceptual metrics) as a function of horizon distance or versus a dense autoregressive rollout at the same anchor points; without this, root-level errors would propagate through the tree rather than being bounded.
[§4] §4 (Experiments): The reported outperformance on quality and drift metrics across five modalities and two baselines is presented without separate ablation of root-anchor accuracy versus increasing temporal spacing. This measurement is required to substantiate that the hierarchical structure bounds drift rather than inheriting and amplifying errors from the initial sparse set.

minor comments (2)

[§3] The notation for the tree depth parameter L and the relationship to the original horizon K could be introduced with a small diagram or explicit equation in the main text for clarity.
[Figures] Figure captions for the qualitative results could include the exact conditioning modality and horizon length for each example to aid direct comparison with the quantitative tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address each major comment below with clarifications on our design choices in the static-camera regime and commit to revisions that will strengthen the empirical support for the anchor-bounded drift claim.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method): The load-bearing claim that ATS converts 'horizon-compounding drift into anchor-bounded drift' rests on the unmodified base model producing faithful sparse root anchors from dense conditioning signals at arbitrary intervals. No experiment or table reports anchor quality (e.g., frame-wise fidelity or perceptual metrics) as a function of horizon distance or versus a dense autoregressive rollout at the same anchor points; without this, root-level errors would propagate through the tree rather than being bounded.

Authors: We agree that direct measurements of root-anchor fidelity would make the central claim more robust. The manuscript emphasizes the static-camera V2V regime precisely because dense conditioning signals (inpainting masks, pose, depth, etc.) supply per-frame information that allows the base model to synthesize consistent sparse anchors without future-frame leakage. Nevertheless, we will add a new ablation in the revised §3 (and supplementary material) reporting frame-wise fidelity (PSNR, LPIPS) and perceptual metrics for root anchors at increasing temporal spacings, directly compared against dense autoregressive generation at the same positions. This will quantify whether root-level errors remain bounded under our conditioning assumptions. revision: yes
Referee: [§4] §4 (Experiments): The reported outperformance on quality and drift metrics across five modalities and two baselines is presented without separate ablation of root-anchor accuracy versus increasing temporal spacing. This measurement is required to substantiate that the hierarchical structure bounds drift rather than inheriting and amplifying errors from the initial sparse set.

Authors: We accept that a dedicated ablation isolating root-anchor accuracy from overall performance would better demonstrate that drift reduction arises from the tree hierarchy rather than from unusually strong initial anchors. In the revised manuscript we will expand §4 with a new subsection that varies root-anchor spacing across the five modalities, reports anchor-specific metrics alongside end-to-end quality and drift scores, and discusses any degradation observed at extreme spacings. This addition will directly address the concern about error inheritance versus bounding. revision: yes

Circularity Check

0 steps flagged

No circularity: ATS is a structural inference-time scheduler

full rationale

The paper presents Anchored Tree Sampling as a training-free inference-time method that replaces sequential rollout with a sparse-to-dense tree hierarchy of anchors. The reduction from K sequential steps to L+1 hierarchical steps and the shift from horizon-compounding to anchor-bounded drift are direct consequences of the tree organization explicitly defined in the method description. The static-camera regime is stated as an applicability precondition under which the unmodified base model can synthesize sparse anchors from dense conditioning, rather than a quantity derived from fitted parameters or self-referential equations. No load-bearing self-citations, ansatzes smuggled via prior work, or renamings of known results appear in the derivation; the central claims are structural and evaluated empirically against external baselines. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the base model can synthesize usable sparse anchors from dense conditioning in the static-camera setting; no free parameters or invented entities are introduced in the abstract description.

axioms (1)

domain assumption Sparse anchors over the horizon are well approximated by the dense conditioning signal in the static-camera regime.
Explicitly stated as the regime in which the method is applied without retraining.

pith-pipeline@v0.9.0 · 5858 in / 1341 out tokens · 33397 ms · 2026-05-21T07:14:12.132356+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 9 internal anchors

[1]

From slow bidirectional to fast autoregressive video diffusion models,

T. Yin, Q. Zhang, R. Zhang, W. T. Freeman, F. Durand, E. Shechtman, and X. Huang, “From slow bidirectional to fast autoregressive video diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22963–22974, 2025

work page 2025
[2]

Self forcing: Bridging the train-test gap in autoregressive video diffusion,

X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman, “Self forcing: Bridging the train-test gap in autoregressive video diffusion,”Advances in Neural Information Processing Systems, vol. 38, pp. 167283– 167308, 2026

work page 2026
[3]

Self-forcing++: Towards minute-scale high-quality video generation,

J. Cui, J. Wu, M. Li, T. Yang, X. Li, R. Wang, A. Bai, Y. Ban, and C.-J. Hsieh, “Self-forcing++: Towards minute-scale high-quality video generation,” inThe Fourteenth International Conference on Learning Representations, 2025

work page 2025
[4]

LongLive: Real-time Interactive Long Video Generation

S. Yang, W. Huang, R. Chu, Y. Xiao, Y. Zhao, X. Wang, M. Li, E. Xie, Y. Chen, Y. Lu,et al., “Longlive: Real-time interactive long video generation,”arXiv preprint arXiv:2509.22622, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Diffusion forcing: Next-token prediction meets full-sequence diffusion,

B. Chen, D. Martí Monsó, Y. Du, M. Simchowitz, R. Tedrake, and V. Sitzmann, “Diffusion forcing: Next-token prediction meets full-sequence diffusion,”Advances in Neural Information Processing Systems, vol. 37, pp. 24081–24125, 2024

work page 2024
[6]

Vace: All-in-one video creation and editing,

Z. Jiang, Z. Han, C. Mao, J. Zhang, Y. Pan, and Y. Liu, “Vace: All-in-one video creation and editing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17191–17202, 2025

work page 2025
[7]

Adapting vace for real-time autoregressive video diffusion,

R. Fosdick, “Adapting vace for real-time autoregressive video diffusion,”arXiv preprint arXiv:2602.14381, 2026

work page arXiv 2026
[8]

Long context tuning for video generation,

Y. Guo, C. Yang, Z. Yang, Z. Ma, Z. Lin, Z. Yang, D. Lin, and L. Jiang, “Long context tuning for video generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17281–17291, 2025

work page 2025
[9]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

H. Zhu, M. Zhao, G. He, H. Su, C. Li, and J. Zhu, “Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation,”arXiv preprint arXiv:2602.02214, 2026

work page internal anchor Pith review arXiv 2026
[10]

Context forcing: Consistent autoregressive video generation with long context,

S. Chen, C. Wei, S. Sun, P. Nie, K. Zhou, G. Zhang, M.-H. Yang, and W. Chen, “Context forcing: Consistent autoregressive video generation with long context,”arXiv preprint arXiv:2602.06028, 2026

work page arXiv 2026
[11]

Relax forcing: Relaxed kv-memory for consistent long video generation, 2026

Z. Zhao, Y. Lu, Z. Liu, J. Song, J. Deng, and I. Patras, “Relax forcing: Relaxed kv-memory for consistent long video generation,”arXiv preprint arXiv:2603.21366, 2026

work page arXiv 2026
[12]

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Y. Gu, w. Mao, and M. Z. Shou, “Long-context autoregressive video modeling with next-frame prediction,” arXiv preprint arXiv:2503.19325, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Memflow: Flowing adaptive memory for consistent and efficient long video narratives,

S. Ji, X. Chen, S. Yang, X. Tao, P. Wan, and H. Zhao, “Memflow: Flowing adaptive memory for consistent and efficient long video narratives,”arXiv preprint arXiv:2512.14699, 2025

work page arXiv 2025
[14]

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Y. Lu, Y. Zeng, H. Li, H. Ouyang, Q. Wang, K. L. Cheng, J. Zhu, H. Cao, Z. Zhang, X. Zhu,et al., “Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation,” arXiv preprint arXiv:2512.04678, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

LPM 1.0: Video-based Character Performance Model

A. Zeng, C. Yang, C. Ge, E. Zhang, G. Xu, G. Lin, G. Gu, J. Pi, L. Li, M. Shi,et al., “Lpm 1.0: Video-based character performance model,”arXiv preprint arXiv:2604.07823, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Mask-predict: Parallel decoding of conditional masked language models,

M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer, “Mask-predict: Parallel decoding of conditional masked language models,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 6112–6121, 2019

work page 2019
[17]

Insertion transformer: Flexible sequence generation via insertion operations,

M. Stern, W. Chan, J. Kiros, and J. Uszkoreit, “Insertion transformer: Flexible sequence generation via insertion operations,” inInternational Conference on Machine Learning, pp. 5976–5985, PMLR, 2019

work page 2019
[18]

Levenshtein transformer,

J. Gu, C. Wang, and J. Zhao, “Levenshtein transformer,”Advances in neural information processing systems, vol. 32, 2019

work page 2019
[19]

Maskgit: Masked generative image 15 transformer,

H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “Maskgit: Masked generative image 15 transformer,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11315–11325, 2022

work page 2022
[20]

Step-unrolled denoising autoencoders for text generation,

N. Savinov, J. Chung, M. Binkowski, E. Elsen, and A. v. d. Oord, “Step-unrolled denoising autoencoders for text generation,”arXiv preprint arXiv:2112.06749, 2021

work page arXiv 2021
[21]

Dcarl: A divide-and-conquer framework for autoregressive long-trajectory video generation,

J. Ouyang, W. Teng, G. Chen, Y. Zhao, and H. Chen, “Dcarl: A divide-and-conquer framework for autoregressive long-trajectory video generation,” 2026

work page 2026
[22]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Y. HaCohen, B. Brazowski, N. Chiprut, Y. Bitterman, A. Kvochko, A. Berkowitz, D. Shalem, D. Lifschitz, D. Moshe, E. Porat,et al., “Ltx-2: Efficient joint audio-visual foundation model,”arXiv preprint arXiv:2601.03233, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Frame context packing and drift prevention in next-frame-prediction video diffusion models,

L. Zhang, S. Cai, M. Li, G. Wetzstein, and M. Agrawala, “Frame context packing and drift prevention in next-frame-prediction video diffusion models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026
[24]

Film: Frame interpolation for large motion,

F. Reda, J. Kontkanen, E. Tabellion, D. Sun, C. Pantofaru, and B. Curless, “Film: Frame interpolation for large motion,” inEuropean Conference on Computer Vision, pp. 250–266, Springer, 2022

work page 2022
[25]

Real-time intermediate flow estimation for video frame interpolation,

Z. Huang, T. Zhang, W. Heng, B. Shi, and S. Zhou, “Real-time intermediate flow estimation for video frame interpolation,” inEuropean conference on computer vision, pp. 624–642, Springer, 2022

work page 2022
[26]

Ldmvfi: Video frame interpolation with latent diffusion models,

D. Danier, F. Zhang, and D. Bull, “Ldmvfi: Video frame interpolation with latent diffusion models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 1472–1480, 2024

work page 2024
[27]

Nuwa-xl: Diffusion over diffusion for extremely long video generation,

S. Yin, C. Wu, H. Yang, J. Wang, X. Wang, M. Ni, Z. Yang, L. Li, S. Liu, F. Yang,et al., “Nuwa-xl: Diffusion over diffusion for extremely long video generation,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1309–1320, 2023

work page 2023
[28]

Phenaki: Variable length video generation from open domain textual descriptions,

R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan, “Phenaki: Variable length video generation from open domain textual descriptions,” in International Conference on Learning Representations, 2023

work page 2023
[29]

Lumiere: A space-time diffusion model for video generation,

O. Bar-Tal, H. Chefer, O. Tov, C. Herrmann, R. Paiss, S. Zada, A. Ephrat, J. Hur, G. Liu, A. Raj, Y. Li, M. Rubinstein, T. Michaeli, O. Wang, D. Sun, T. Dekel, and I. Mosseri, “Lumiere: A space-time diffusion model for video generation,” 2024

work page 2024
[30]

Freenoise: Tuning-free longer video diffusion via noise rescheduling,

H. Qiu, M. Xia, Y. Zhang, Y. He, X. Wang, Y. Shan, and Z. Liu, “Freenoise: Tuning-free longer video diffusion via noise rescheduling,” inInternational Conference on Learning Representations, vol. 2024, pp. 5260–5274, 2024

work page 2024
[31]

Gen-l-video: Multi-text to long video generation via temporal co-denoising,

F.-Y. Wang, W. Chen, G. Song, H.-J. Ye, Y. Liu, and H. Li, “Gen-l-video: Multi-text to long video generation via temporal co-denoising,” 2023

work page 2023
[32]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text,

R. Henschel, L. Khachatryan, H. Poghosyan, D. Hayrapetyan, V. Tadevosyan, Z. Wang, S. Navasardyan, and H. Shi, “Streamingt2v: Consistent, dynamic, and extendable long video generation from text,” in Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 2568–2577, 2025

work page 2025
[33]

Tree of thoughts: Deliberate problem solving with large language models,

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,”Advances in neural information processing systems, vol. 36, pp. 11809–11822, 2023

work page 2023
[34]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling llm test-time compute optimally can be more effective than scaling model parameters,”arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Wan: Open and advanced large-scale video generative models,

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

work page 2025
[36]

Depth Anything 3: Recovering the Visual Space from Any Views

H. Lin, S. Chen, J. H. Liew, D. Y. Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth anything 3: Recovering the visual space from any views,”arXiv preprint arXiv:2511.10647, 2025. 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

A computational approach to edge detection,

J. Canny, “A computational approach to edge detection,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-8, no. 6, pp. 679–698, 1986

work page 1986
[38]

Effective whole-body pose estimation with two-stages distillation,

Z. Yang, A. Zeng, C. Yuan, and Y. Li, “Effective whole-body pose estimation with two-stages distillation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pp. 4210–4220, 2023

work page 2023
[39]

VBench: Comprehensive benchmark suite for video generative models,

Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y. Wang, X. Chen, L. Wang, D. Lin, Y. Qiao, and Z. Liu, “VBench: Comprehensive benchmark suite for video generative models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[40]

Seedance 2.0: Advancing Video Generation for World Complexity

T. Seedance, D. Chen, L. Chen, X. Chen, Y. Chen, Z. Chen, Z. Chen, F. Cheng, T. Cheng, Y. Cheng, et al., “Seedance 2.0: Advancing video generation for world complexity,”arXiv preprint arXiv:2604.14148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

Openvid-1m: A large-scale high-quality dataset for text-to-video generation,

K. Nan, R. Xie, P. Zhou, T. Fan, Z. Yang, Z. Chen, X. Li, J. Yang, and Y. Tai, “Openvid-1m: A large-scale high-quality dataset for text-to-video generation,” inInternational Conference on Learning Representations, vol. 2025, pp. 1045–1064, 2025

work page 2025
[42]

Moviebench: A hierarchical movie level dataset for long video generation,

W. Wu, M. Liu, Z. Zhu, X. Xia, H. Feng, W. Wang, K. Q. Lin, C. Shen, and M. Z. Shou, “Moviebench: A hierarchical movie level dataset for long video generation,” 2025. 17 A Additional figures A.1 Full-duration reels Figure 8Long-form edge-conditioned generation on Wan2.1 +VACE: AR vs. ATS on the same checkpoints. 18 Figure 9Long-form depth-conditioned gene...

work page 2025

[1] [1]

From slow bidirectional to fast autoregressive video diffusion models,

T. Yin, Q. Zhang, R. Zhang, W. T. Freeman, F. Durand, E. Shechtman, and X. Huang, “From slow bidirectional to fast autoregressive video diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22963–22974, 2025

work page 2025

[2] [2]

Self forcing: Bridging the train-test gap in autoregressive video diffusion,

X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman, “Self forcing: Bridging the train-test gap in autoregressive video diffusion,”Advances in Neural Information Processing Systems, vol. 38, pp. 167283– 167308, 2026

work page 2026

[3] [3]

Self-forcing++: Towards minute-scale high-quality video generation,

J. Cui, J. Wu, M. Li, T. Yang, X. Li, R. Wang, A. Bai, Y. Ban, and C.-J. Hsieh, “Self-forcing++: Towards minute-scale high-quality video generation,” inThe Fourteenth International Conference on Learning Representations, 2025

work page 2025

[4] [4]

LongLive: Real-time Interactive Long Video Generation

S. Yang, W. Huang, R. Chu, Y. Xiao, Y. Zhao, X. Wang, M. Li, E. Xie, Y. Chen, Y. Lu,et al., “Longlive: Real-time interactive long video generation,”arXiv preprint arXiv:2509.22622, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Diffusion forcing: Next-token prediction meets full-sequence diffusion,

B. Chen, D. Martí Monsó, Y. Du, M. Simchowitz, R. Tedrake, and V. Sitzmann, “Diffusion forcing: Next-token prediction meets full-sequence diffusion,”Advances in Neural Information Processing Systems, vol. 37, pp. 24081–24125, 2024

work page 2024

[6] [6]

Vace: All-in-one video creation and editing,

Z. Jiang, Z. Han, C. Mao, J. Zhang, Y. Pan, and Y. Liu, “Vace: All-in-one video creation and editing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17191–17202, 2025

work page 2025

[7] [7]

Adapting vace for real-time autoregressive video diffusion,

R. Fosdick, “Adapting vace for real-time autoregressive video diffusion,”arXiv preprint arXiv:2602.14381, 2026

work page arXiv 2026

[8] [8]

Long context tuning for video generation,

Y. Guo, C. Yang, Z. Yang, Z. Ma, Z. Lin, Z. Yang, D. Lin, and L. Jiang, “Long context tuning for video generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17281–17291, 2025

work page 2025

[9] [9]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

H. Zhu, M. Zhao, G. He, H. Su, C. Li, and J. Zhu, “Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation,”arXiv preprint arXiv:2602.02214, 2026

work page internal anchor Pith review arXiv 2026

[10] [10]

Context forcing: Consistent autoregressive video generation with long context,

S. Chen, C. Wei, S. Sun, P. Nie, K. Zhou, G. Zhang, M.-H. Yang, and W. Chen, “Context forcing: Consistent autoregressive video generation with long context,”arXiv preprint arXiv:2602.06028, 2026

work page arXiv 2026

[11] [11]

Relax forcing: Relaxed kv-memory for consistent long video generation, 2026

Z. Zhao, Y. Lu, Z. Liu, J. Song, J. Deng, and I. Patras, “Relax forcing: Relaxed kv-memory for consistent long video generation,”arXiv preprint arXiv:2603.21366, 2026

work page arXiv 2026

[12] [12]

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Y. Gu, w. Mao, and M. Z. Shou, “Long-context autoregressive video modeling with next-frame prediction,” arXiv preprint arXiv:2503.19325, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Memflow: Flowing adaptive memory for consistent and efficient long video narratives,

S. Ji, X. Chen, S. Yang, X. Tao, P. Wan, and H. Zhao, “Memflow: Flowing adaptive memory for consistent and efficient long video narratives,”arXiv preprint arXiv:2512.14699, 2025

work page arXiv 2025

[14] [14]

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Y. Lu, Y. Zeng, H. Li, H. Ouyang, Q. Wang, K. L. Cheng, J. Zhu, H. Cao, Z. Zhang, X. Zhu,et al., “Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation,” arXiv preprint arXiv:2512.04678, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

LPM 1.0: Video-based Character Performance Model

A. Zeng, C. Yang, C. Ge, E. Zhang, G. Xu, G. Lin, G. Gu, J. Pi, L. Li, M. Shi,et al., “Lpm 1.0: Video-based character performance model,”arXiv preprint arXiv:2604.07823, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Mask-predict: Parallel decoding of conditional masked language models,

M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer, “Mask-predict: Parallel decoding of conditional masked language models,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 6112–6121, 2019

work page 2019

[17] [17]

Insertion transformer: Flexible sequence generation via insertion operations,

M. Stern, W. Chan, J. Kiros, and J. Uszkoreit, “Insertion transformer: Flexible sequence generation via insertion operations,” inInternational Conference on Machine Learning, pp. 5976–5985, PMLR, 2019

work page 2019

[18] [18]

Levenshtein transformer,

J. Gu, C. Wang, and J. Zhao, “Levenshtein transformer,”Advances in neural information processing systems, vol. 32, 2019

work page 2019

[19] [19]

Maskgit: Masked generative image 15 transformer,

H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “Maskgit: Masked generative image 15 transformer,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11315–11325, 2022

work page 2022

[20] [20]

Step-unrolled denoising autoencoders for text generation,

N. Savinov, J. Chung, M. Binkowski, E. Elsen, and A. v. d. Oord, “Step-unrolled denoising autoencoders for text generation,”arXiv preprint arXiv:2112.06749, 2021

work page arXiv 2021

[21] [21]

Dcarl: A divide-and-conquer framework for autoregressive long-trajectory video generation,

J. Ouyang, W. Teng, G. Chen, Y. Zhao, and H. Chen, “Dcarl: A divide-and-conquer framework for autoregressive long-trajectory video generation,” 2026

work page 2026

[22] [22]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Y. HaCohen, B. Brazowski, N. Chiprut, Y. Bitterman, A. Kvochko, A. Berkowitz, D. Shalem, D. Lifschitz, D. Moshe, E. Porat,et al., “Ltx-2: Efficient joint audio-visual foundation model,”arXiv preprint arXiv:2601.03233, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Frame context packing and drift prevention in next-frame-prediction video diffusion models,

L. Zhang, S. Cai, M. Li, G. Wetzstein, and M. Agrawala, “Frame context packing and drift prevention in next-frame-prediction video diffusion models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026

[24] [24]

Film: Frame interpolation for large motion,

F. Reda, J. Kontkanen, E. Tabellion, D. Sun, C. Pantofaru, and B. Curless, “Film: Frame interpolation for large motion,” inEuropean Conference on Computer Vision, pp. 250–266, Springer, 2022

work page 2022

[25] [25]

Real-time intermediate flow estimation for video frame interpolation,

Z. Huang, T. Zhang, W. Heng, B. Shi, and S. Zhou, “Real-time intermediate flow estimation for video frame interpolation,” inEuropean conference on computer vision, pp. 624–642, Springer, 2022

work page 2022

[26] [26]

Ldmvfi: Video frame interpolation with latent diffusion models,

D. Danier, F. Zhang, and D. Bull, “Ldmvfi: Video frame interpolation with latent diffusion models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 1472–1480, 2024

work page 2024

[27] [27]

Nuwa-xl: Diffusion over diffusion for extremely long video generation,

S. Yin, C. Wu, H. Yang, J. Wang, X. Wang, M. Ni, Z. Yang, L. Li, S. Liu, F. Yang,et al., “Nuwa-xl: Diffusion over diffusion for extremely long video generation,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1309–1320, 2023

work page 2023

[28] [28]

Phenaki: Variable length video generation from open domain textual descriptions,

R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan, “Phenaki: Variable length video generation from open domain textual descriptions,” in International Conference on Learning Representations, 2023

work page 2023

[29] [29]

Lumiere: A space-time diffusion model for video generation,

O. Bar-Tal, H. Chefer, O. Tov, C. Herrmann, R. Paiss, S. Zada, A. Ephrat, J. Hur, G. Liu, A. Raj, Y. Li, M. Rubinstein, T. Michaeli, O. Wang, D. Sun, T. Dekel, and I. Mosseri, “Lumiere: A space-time diffusion model for video generation,” 2024

work page 2024

[30] [30]

Freenoise: Tuning-free longer video diffusion via noise rescheduling,

H. Qiu, M. Xia, Y. Zhang, Y. He, X. Wang, Y. Shan, and Z. Liu, “Freenoise: Tuning-free longer video diffusion via noise rescheduling,” inInternational Conference on Learning Representations, vol. 2024, pp. 5260–5274, 2024

work page 2024

[31] [31]

Gen-l-video: Multi-text to long video generation via temporal co-denoising,

F.-Y. Wang, W. Chen, G. Song, H.-J. Ye, Y. Liu, and H. Li, “Gen-l-video: Multi-text to long video generation via temporal co-denoising,” 2023

work page 2023

[32] [32]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text,

R. Henschel, L. Khachatryan, H. Poghosyan, D. Hayrapetyan, V. Tadevosyan, Z. Wang, S. Navasardyan, and H. Shi, “Streamingt2v: Consistent, dynamic, and extendable long video generation from text,” in Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 2568–2577, 2025

work page 2025

[33] [33]

Tree of thoughts: Deliberate problem solving with large language models,

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,”Advances in neural information processing systems, vol. 36, pp. 11809–11822, 2023

work page 2023

[34] [34]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling llm test-time compute optimally can be more effective than scaling model parameters,”arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Wan: Open and advanced large-scale video generative models,

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

work page 2025

[36] [36]

Depth Anything 3: Recovering the Visual Space from Any Views

H. Lin, S. Chen, J. H. Liew, D. Y. Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth anything 3: Recovering the visual space from any views,”arXiv preprint arXiv:2511.10647, 2025. 16

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

A computational approach to edge detection,

J. Canny, “A computational approach to edge detection,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-8, no. 6, pp. 679–698, 1986

work page 1986

[38] [38]

Effective whole-body pose estimation with two-stages distillation,

Z. Yang, A. Zeng, C. Yuan, and Y. Li, “Effective whole-body pose estimation with two-stages distillation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pp. 4210–4220, 2023

work page 2023

[39] [39]

VBench: Comprehensive benchmark suite for video generative models,

Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y. Wang, X. Chen, L. Wang, D. Lin, Y. Qiao, and Z. Liu, “VBench: Comprehensive benchmark suite for video generative models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[40] [40]

Seedance 2.0: Advancing Video Generation for World Complexity

T. Seedance, D. Chen, L. Chen, X. Chen, Y. Chen, Z. Chen, Z. Chen, F. Cheng, T. Cheng, Y. Cheng, et al., “Seedance 2.0: Advancing video generation for world complexity,”arXiv preprint arXiv:2604.14148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[41] [41]

Openvid-1m: A large-scale high-quality dataset for text-to-video generation,

K. Nan, R. Xie, P. Zhou, T. Fan, Z. Yang, Z. Chen, X. Li, J. Yang, and Y. Tai, “Openvid-1m: A large-scale high-quality dataset for text-to-video generation,” inInternational Conference on Learning Representations, vol. 2025, pp. 1045–1064, 2025

work page 2025

[42] [42]

Moviebench: A hierarchical movie level dataset for long video generation,

W. Wu, M. Liu, Z. Zhu, X. Xia, H. Feng, W. Wang, K. Q. Lin, C. Shen, and M. Z. Shou, “Moviebench: A hierarchical movie level dataset for long video generation,” 2025. 17 A Additional figures A.1 Full-duration reels Figure 8Long-form edge-conditioned generation on Wan2.1 +VACE: AR vs. ATS on the same checkpoints. 18 Figure 9Long-form depth-conditioned gene...

work page 2025