Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation
Pith reviewed 2026-05-21 07:14 UTC · model grok-4.3
The pith
Anchored Tree Sampling replaces sequential video rollout with a hierarchy of sparse anchors to bound drift.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Anchored Tree Sampling is a training-free inference-time scheduler that replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A root call produces sparse anchors over the full horizon, recursive refinement generates intermediate anchors, and final leaf spans are synthesized between neighboring anchors. This reduces the critical path from K sequential rollout steps to L+1 tree-hierarchical steps and converts horizon-compounding drift into anchor-bounded drift. The method focuses on the static-camera regime where sparse anchors over the horizon are well approximated by the dense conditioning signal so that the base model can produce them without
What carries the argument
Anchored Tree Sampling, a scheduler that organizes generation as root anchors, recursive refinement, and leaf spans to confine drift between nearby references.
If this is right
- ATS outperforms two contemporary autoregressive baselines on Wan 2.1 plus VACE across five conditioning modalities in overall quality and drift prevention.
- The method supports stable generation of at least 40 minutes on LTX-2.3 across the same modalities.
- The critical path length drops from K sequential steps to L plus one tree-hierarchical steps.
- The paper proposes extending the approach to arbitrarily long text-to-video generation and to dynamic-camera and multi-shot regimes.
Where Pith is reading between the lines
- The tree structure could reduce compounding errors in other autoregressive domains such as audio or long text sequences.
- Pairing ATS with existing distillation techniques might further improve anchor quality without changing the inference schedule.
- Explicit modeling of camera motion would be needed before the same anchoring logic applies reliably outside the static-camera case.
Load-bearing premise
Sparse anchors over the full horizon are well approximated by the dense conditioning signal so the base model can produce them without retraining.
What would settle it
Generate a long sequence where the initial sparse anchor frames visibly mismatch the conditioning signals at distant time points and measure whether continuity collapses between those anchors.
read the original abstract
Long-horizon video generation suffers from two intertwined issues. First, there is drift, where video quality degrades over time. Second, there are continuity issues which manifest as object permanence issues, or improperly rendering transient content (e.g., an object that appears in non-consecutive frames changing color/style). Recent work has focused on autoregressive distillation techniques that attack both problems simultaneously. We instead choose to focus on drift directly and introduce \textbf{Anchored Tree Sampling (ATS)}: a training-free inference-time scheduler that replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A root call produces sparse anchors over the full horizon, recursive refinement generates intermediate anchors, and final leaf spans are synthesized between neighboring anchors. This reduces the critical path from $K$ sequential rollout steps to $L+1$ tree-hierarchical steps and converts horizon-compounding drift into anchor-bounded drift. We focus on V2V generation in the \emph{static-camera} regime, where sparse anchors over the horizon are well approximated by the dense conditioning signal, and the base model can produce them without retraining. We evaluate ATS against two contemporary autoregressive baselines on Wan $2.1$ $+$ VACE, across five conditioning modalities (inpainting, outpainting, edge, pose, depth). We show that ATS outperforms both competitors in overall quality, as well as in drift prevention. We additionally demonstrate stable $\geq 40$-minute generation on LTX-$2.3$ across the same five modalities. We conclude by proposing a path forward to extend ATS to arbitrarily long T2V generation, as well as the dynamic-camera and multi-shot regimes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Anchored Tree Sampling (ATS), a training-free inference-time scheduler for long-horizon video-to-video generation. It replaces left-to-right autoregressive rollout with a tree-structured sparse-to-dense process: a root call generates sparse anchors over the full horizon from dense conditioning, recursive refinement produces intermediate anchors, and leaf spans are synthesized between anchors. This reduces the critical path from K sequential steps to L+1 hierarchical steps and converts compounding drift into anchor-bounded drift. The method targets the static-camera V2V regime on base models such as Wan 2.1 + VACE and is evaluated across five conditioning modalities (inpainting, outpainting, edge, pose, depth), reporting outperformance versus two autoregressive baselines in quality and drift metrics plus stable generation up to 40 minutes on LTX-2.3.
Significance. If the central results hold, ATS would represent a practical advance for extending video generation horizons without retraining or distillation. The approach is parameter-free, inference-only, and structurally converts the drift problem into one of anchor quality; these are clear strengths. The demonstration of multi-modality applicability and long-horizon stability (40+ minutes) could influence inference scheduling practices in video synthesis if the anchor-bounding claim is substantiated.
major comments (2)
- [Abstract and §3] Abstract and §3 (Method): The load-bearing claim that ATS converts 'horizon-compounding drift into anchor-bounded drift' rests on the unmodified base model producing faithful sparse root anchors from dense conditioning signals at arbitrary intervals. No experiment or table reports anchor quality (e.g., frame-wise fidelity or perceptual metrics) as a function of horizon distance or versus a dense autoregressive rollout at the same anchor points; without this, root-level errors would propagate through the tree rather than being bounded.
- [§4] §4 (Experiments): The reported outperformance on quality and drift metrics across five modalities and two baselines is presented without separate ablation of root-anchor accuracy versus increasing temporal spacing. This measurement is required to substantiate that the hierarchical structure bounds drift rather than inheriting and amplifying errors from the initial sparse set.
minor comments (2)
- [§3] The notation for the tree depth parameter L and the relationship to the original horizon K could be introduced with a small diagram or explicit equation in the main text for clarity.
- [Figures] Figure captions for the qualitative results could include the exact conditioning modality and horizon length for each example to aid direct comparison with the quantitative tables.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. We address each major comment below with clarifications on our design choices in the static-camera regime and commit to revisions that will strengthen the empirical support for the anchor-bounded drift claim.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Method): The load-bearing claim that ATS converts 'horizon-compounding drift into anchor-bounded drift' rests on the unmodified base model producing faithful sparse root anchors from dense conditioning signals at arbitrary intervals. No experiment or table reports anchor quality (e.g., frame-wise fidelity or perceptual metrics) as a function of horizon distance or versus a dense autoregressive rollout at the same anchor points; without this, root-level errors would propagate through the tree rather than being bounded.
Authors: We agree that direct measurements of root-anchor fidelity would make the central claim more robust. The manuscript emphasizes the static-camera V2V regime precisely because dense conditioning signals (inpainting masks, pose, depth, etc.) supply per-frame information that allows the base model to synthesize consistent sparse anchors without future-frame leakage. Nevertheless, we will add a new ablation in the revised §3 (and supplementary material) reporting frame-wise fidelity (PSNR, LPIPS) and perceptual metrics for root anchors at increasing temporal spacings, directly compared against dense autoregressive generation at the same positions. This will quantify whether root-level errors remain bounded under our conditioning assumptions. revision: yes
-
Referee: [§4] §4 (Experiments): The reported outperformance on quality and drift metrics across five modalities and two baselines is presented without separate ablation of root-anchor accuracy versus increasing temporal spacing. This measurement is required to substantiate that the hierarchical structure bounds drift rather than inheriting and amplifying errors from the initial sparse set.
Authors: We accept that a dedicated ablation isolating root-anchor accuracy from overall performance would better demonstrate that drift reduction arises from the tree hierarchy rather than from unusually strong initial anchors. In the revised manuscript we will expand §4 with a new subsection that varies root-anchor spacing across the five modalities, reports anchor-specific metrics alongside end-to-end quality and drift scores, and discusses any degradation observed at extreme spacings. This addition will directly address the concern about error inheritance versus bounding. revision: yes
Circularity Check
No circularity: ATS is a structural inference-time scheduler
full rationale
The paper presents Anchored Tree Sampling as a training-free inference-time method that replaces sequential rollout with a sparse-to-dense tree hierarchy of anchors. The reduction from K sequential steps to L+1 hierarchical steps and the shift from horizon-compounding to anchor-bounded drift are direct consequences of the tree organization explicitly defined in the method description. The static-camera regime is stated as an applicability precondition under which the unmodified base model can synthesize sparse anchors from dense conditioning, rather than a quantity derived from fitted parameters or self-referential equations. No load-bearing self-citations, ansatzes smuggled via prior work, or renamings of known results appear in the derivation; the central claims are structural and evaluated empirically against external baselines. The derivation chain is therefore self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sparse anchors over the horizon are well approximated by the dense conditioning signal in the static-camera regime.
Reference graph
Works this paper leans on
-
[1]
From slow bidirectional to fast autoregressive video diffusion models,
T. Yin, Q. Zhang, R. Zhang, W. T. Freeman, F. Durand, E. Shechtman, and X. Huang, “From slow bidirectional to fast autoregressive video diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22963–22974, 2025
work page 2025
-
[2]
Self forcing: Bridging the train-test gap in autoregressive video diffusion,
X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman, “Self forcing: Bridging the train-test gap in autoregressive video diffusion,”Advances in Neural Information Processing Systems, vol. 38, pp. 167283– 167308, 2026
work page 2026
-
[3]
Self-forcing++: Towards minute-scale high-quality video generation,
J. Cui, J. Wu, M. Li, T. Yang, X. Li, R. Wang, A. Bai, Y. Ban, and C.-J. Hsieh, “Self-forcing++: Towards minute-scale high-quality video generation,” inThe Fourteenth International Conference on Learning Representations, 2025
work page 2025
-
[4]
LongLive: Real-time Interactive Long Video Generation
S. Yang, W. Huang, R. Chu, Y. Xiao, Y. Zhao, X. Wang, M. Li, E. Xie, Y. Chen, Y. Lu,et al., “Longlive: Real-time interactive long video generation,”arXiv preprint arXiv:2509.22622, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Diffusion forcing: Next-token prediction meets full-sequence diffusion,
B. Chen, D. Martí Monsó, Y. Du, M. Simchowitz, R. Tedrake, and V. Sitzmann, “Diffusion forcing: Next-token prediction meets full-sequence diffusion,”Advances in Neural Information Processing Systems, vol. 37, pp. 24081–24125, 2024
work page 2024
-
[6]
Vace: All-in-one video creation and editing,
Z. Jiang, Z. Han, C. Mao, J. Zhang, Y. Pan, and Y. Liu, “Vace: All-in-one video creation and editing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17191–17202, 2025
work page 2025
-
[7]
Adapting vace for real-time autoregressive video diffusion,
R. Fosdick, “Adapting vace for real-time autoregressive video diffusion,”arXiv preprint arXiv:2602.14381, 2026
-
[8]
Long context tuning for video generation,
Y. Guo, C. Yang, Z. Yang, Z. Ma, Z. Lin, Z. Yang, D. Lin, and L. Jiang, “Long context tuning for video generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17281–17291, 2025
work page 2025
-
[9]
H. Zhu, M. Zhao, G. He, H. Su, C. Li, and J. Zhu, “Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation,”arXiv preprint arXiv:2602.02214, 2026
work page internal anchor Pith review arXiv 2026
-
[10]
Context forcing: Consistent autoregressive video generation with long context,
S. Chen, C. Wei, S. Sun, P. Nie, K. Zhou, G. Zhang, M.-H. Yang, and W. Chen, “Context forcing: Consistent autoregressive video generation with long context,”arXiv preprint arXiv:2602.06028, 2026
-
[11]
Relax forcing: Relaxed kv-memory for consistent long video generation, 2026
Z. Zhao, Y. Lu, Z. Liu, J. Song, J. Deng, and I. Patras, “Relax forcing: Relaxed kv-memory for consistent long video generation,”arXiv preprint arXiv:2603.21366, 2026
-
[12]
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Y. Gu, w. Mao, and M. Z. Shou, “Long-context autoregressive video modeling with next-frame prediction,” arXiv preprint arXiv:2503.19325, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Memflow: Flowing adaptive memory for consistent and efficient long video narratives,
S. Ji, X. Chen, S. Yang, X. Tao, P. Wan, and H. Zhao, “Memflow: Flowing adaptive memory for consistent and efficient long video narratives,”arXiv preprint arXiv:2512.14699, 2025
-
[14]
Y. Lu, Y. Zeng, H. Li, H. Ouyang, Q. Wang, K. L. Cheng, J. Zhu, H. Cao, Z. Zhang, X. Zhu,et al., “Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation,” arXiv preprint arXiv:2512.04678, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
LPM 1.0: Video-based Character Performance Model
A. Zeng, C. Yang, C. Ge, E. Zhang, G. Xu, G. Lin, G. Gu, J. Pi, L. Li, M. Shi,et al., “Lpm 1.0: Video-based character performance model,”arXiv preprint arXiv:2604.07823, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
Mask-predict: Parallel decoding of conditional masked language models,
M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer, “Mask-predict: Parallel decoding of conditional masked language models,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 6112–6121, 2019
work page 2019
-
[17]
Insertion transformer: Flexible sequence generation via insertion operations,
M. Stern, W. Chan, J. Kiros, and J. Uszkoreit, “Insertion transformer: Flexible sequence generation via insertion operations,” inInternational Conference on Machine Learning, pp. 5976–5985, PMLR, 2019
work page 2019
-
[18]
J. Gu, C. Wang, and J. Zhao, “Levenshtein transformer,”Advances in neural information processing systems, vol. 32, 2019
work page 2019
-
[19]
Maskgit: Masked generative image 15 transformer,
H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “Maskgit: Masked generative image 15 transformer,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11315–11325, 2022
work page 2022
-
[20]
Step-unrolled denoising autoencoders for text generation,
N. Savinov, J. Chung, M. Binkowski, E. Elsen, and A. v. d. Oord, “Step-unrolled denoising autoencoders for text generation,”arXiv preprint arXiv:2112.06749, 2021
-
[21]
Dcarl: A divide-and-conquer framework for autoregressive long-trajectory video generation,
J. Ouyang, W. Teng, G. Chen, Y. Zhao, and H. Chen, “Dcarl: A divide-and-conquer framework for autoregressive long-trajectory video generation,” 2026
work page 2026
-
[22]
LTX-2: Efficient Joint Audio-Visual Foundation Model
Y. HaCohen, B. Brazowski, N. Chiprut, Y. Bitterman, A. Kvochko, A. Berkowitz, D. Shalem, D. Lifschitz, D. Moshe, E. Porat,et al., “Ltx-2: Efficient joint audio-visual foundation model,”arXiv preprint arXiv:2601.03233, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Frame context packing and drift prevention in next-frame-prediction video diffusion models,
L. Zhang, S. Cai, M. Li, G. Wetzstein, and M. Agrawala, “Frame context packing and drift prevention in next-frame-prediction video diffusion models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
work page 2026
-
[24]
Film: Frame interpolation for large motion,
F. Reda, J. Kontkanen, E. Tabellion, D. Sun, C. Pantofaru, and B. Curless, “Film: Frame interpolation for large motion,” inEuropean Conference on Computer Vision, pp. 250–266, Springer, 2022
work page 2022
-
[25]
Real-time intermediate flow estimation for video frame interpolation,
Z. Huang, T. Zhang, W. Heng, B. Shi, and S. Zhou, “Real-time intermediate flow estimation for video frame interpolation,” inEuropean conference on computer vision, pp. 624–642, Springer, 2022
work page 2022
-
[26]
Ldmvfi: Video frame interpolation with latent diffusion models,
D. Danier, F. Zhang, and D. Bull, “Ldmvfi: Video frame interpolation with latent diffusion models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 1472–1480, 2024
work page 2024
-
[27]
Nuwa-xl: Diffusion over diffusion for extremely long video generation,
S. Yin, C. Wu, H. Yang, J. Wang, X. Wang, M. Ni, Z. Yang, L. Li, S. Liu, F. Yang,et al., “Nuwa-xl: Diffusion over diffusion for extremely long video generation,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1309–1320, 2023
work page 2023
-
[28]
Phenaki: Variable length video generation from open domain textual descriptions,
R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan, “Phenaki: Variable length video generation from open domain textual descriptions,” in International Conference on Learning Representations, 2023
work page 2023
-
[29]
Lumiere: A space-time diffusion model for video generation,
O. Bar-Tal, H. Chefer, O. Tov, C. Herrmann, R. Paiss, S. Zada, A. Ephrat, J. Hur, G. Liu, A. Raj, Y. Li, M. Rubinstein, T. Michaeli, O. Wang, D. Sun, T. Dekel, and I. Mosseri, “Lumiere: A space-time diffusion model for video generation,” 2024
work page 2024
-
[30]
Freenoise: Tuning-free longer video diffusion via noise rescheduling,
H. Qiu, M. Xia, Y. Zhang, Y. He, X. Wang, Y. Shan, and Z. Liu, “Freenoise: Tuning-free longer video diffusion via noise rescheduling,” inInternational Conference on Learning Representations, vol. 2024, pp. 5260–5274, 2024
work page 2024
-
[31]
Gen-l-video: Multi-text to long video generation via temporal co-denoising,
F.-Y. Wang, W. Chen, G. Song, H.-J. Ye, Y. Liu, and H. Li, “Gen-l-video: Multi-text to long video generation via temporal co-denoising,” 2023
work page 2023
-
[32]
Streamingt2v: Consistent, dynamic, and extendable long video generation from text,
R. Henschel, L. Khachatryan, H. Poghosyan, D. Hayrapetyan, V. Tadevosyan, Z. Wang, S. Navasardyan, and H. Shi, “Streamingt2v: Consistent, dynamic, and extendable long video generation from text,” in Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 2568–2577, 2025
work page 2025
-
[33]
Tree of thoughts: Deliberate problem solving with large language models,
S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,”Advances in neural information processing systems, vol. 36, pp. 11809–11822, 2023
work page 2023
-
[34]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling llm test-time compute optimally can be more effective than scaling model parameters,”arXiv preprint arXiv:2408.03314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Wan: Open and advanced large-scale video generative models,
T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....
work page 2025
-
[36]
Depth Anything 3: Recovering the Visual Space from Any Views
H. Lin, S. Chen, J. H. Liew, D. Y. Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth anything 3: Recovering the visual space from any views,”arXiv preprint arXiv:2511.10647, 2025. 16
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
A computational approach to edge detection,
J. Canny, “A computational approach to edge detection,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-8, no. 6, pp. 679–698, 1986
work page 1986
-
[38]
Effective whole-body pose estimation with two-stages distillation,
Z. Yang, A. Zeng, C. Yuan, and Y. Li, “Effective whole-body pose estimation with two-stages distillation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pp. 4210–4220, 2023
work page 2023
-
[39]
VBench: Comprehensive benchmark suite for video generative models,
Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y. Wang, X. Chen, L. Wang, D. Lin, Y. Qiao, and Z. Liu, “VBench: Comprehensive benchmark suite for video generative models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[40]
Seedance 2.0: Advancing Video Generation for World Complexity
T. Seedance, D. Chen, L. Chen, X. Chen, Y. Chen, Z. Chen, Z. Chen, F. Cheng, T. Cheng, Y. Cheng, et al., “Seedance 2.0: Advancing video generation for world complexity,”arXiv preprint arXiv:2604.14148, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[41]
Openvid-1m: A large-scale high-quality dataset for text-to-video generation,
K. Nan, R. Xie, P. Zhou, T. Fan, Z. Yang, Z. Chen, X. Li, J. Yang, and Y. Tai, “Openvid-1m: A large-scale high-quality dataset for text-to-video generation,” inInternational Conference on Learning Representations, vol. 2025, pp. 1045–1064, 2025
work page 2025
-
[42]
Moviebench: A hierarchical movie level dataset for long video generation,
W. Wu, M. Liu, Z. Zhu, X. Xia, H. Feng, W. Wang, K. Q. Lin, C. Shen, and M. Z. Shou, “Moviebench: A hierarchical movie level dataset for long video generation,” 2025. 17 A Additional figures A.1 Full-duration reels Figure 8Long-form edge-conditioned generation on Wan2.1 +VACE: AR vs. ATS on the same checkpoints. 18 Figure 9Long-form depth-conditioned gene...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.