pith. sign in

arxiv: 2606.18702 · v1 · pith:J7KJPPK6new · submitted 2026-06-17 · 💻 cs.CV

UniTemp: Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation

Pith reviewed 2026-06-26 21:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords video generationautoregressive diffusionbidirectional generationtemporal orderdistillationvideo extensioninbetween generationcausal VAE
0
0 comments X

The pith

One autoregressive video model generates in any temporal direction via bidirectional distillation and anchor latents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to lift the forward-only limit of autoregressive video diffusion models so that a single network can generate forward, backward, or in between given frames. It does this by training the model with bidirectional distillation while adding blockwise anchor latents that supply missing past context at block edges when the causal VAE runs backward. A sympathetic reader would care because real video workflows rarely follow a strict forward stream; they often require extending a clip from future frames, filling gaps, or creating loops. Experiments indicate the resulting model matches forward-only baselines on short and long clips yet unlocks the extra generation modes.

Core claim

UniTemp trains one autoregressive student model that conditions on arbitrary past and future frames by using blockwise anchor latents to restore the context the causal 3D VAE withholds during backward passes, thereby supporting bidirectional extension, inbetween generation, and other flexible workflows at inference time while preserving competitive quality on standard video benchmarks.

What carries the argument

blockwise anchor latents that restore missing past context at block boundaries during backward generation, inside a bidirectional distillation framework that trains the single autoregressive model.

If this is right

  • The model conditions on future frames alone to extend video backward.
  • It fills frames between given past and future clips for inbetween generation.
  • It produces looping videos and handles scene transitions by mixing conditioning directions.
  • It supports visual story generation by sequencing clips in non-forward orders.
  • Performance on short and long forward video tasks stays comparable to forward-only baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same anchor-latent fix could be tested on other causal encoders used in audio or text sequence models.
  • A single trained checkpoint might replace multiple direction-specific models in video editing tools.
  • Interactive applications could change generation direction mid-clip without reloading weights.

Load-bearing premise

The causal 3D VAE produces inter-block discontinuities in backward generation that can be fixed by auxiliary anchor latents without hurting forward performance.

What would settle it

Run backward generation on the same model and sequences with the anchor latents removed and measure whether visible discontinuities or motion breaks appear at block boundaries.

Figures

Figures reproduced from arXiv: 2606.18702 by Jinhong Lin, Jiuxiang Gu, Krishna Kumar Singh, Lin Zhang, Sicheng Mo, Yin Li, Yuheng Li, Zefan Cai, Zihao Lin.

Figure 1
Figure 1. Figure 1: We present UniTemp, a unified distillation framework that delivers a single model capable of flexibly generating video conditioned on past context, future context, or both, and supporting a wide range of generation tasks. 43]. Despite their impressive performance, these models require multiple denois￾ing steps with full-sequence attention at inference, making them computationally expensive and difficult to… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of inter-block flickering in backward generation. Without anchor latents, visible discontinuities appear at block boundaries. temporal smoothness, thus lower FR. Inter-block latents can only attend in one direction through cached keys and values, and therefore should demonstrate higher FR. We validate this empirically in Tab. 1, where both forward and back￾ward generation show inter-block FRs… view at source ↗
Figure 3
Figure 3. Figure 3: Left: Causal design of the frozen 3D VAE. It encodes video into spatial￾temporal latents (V) with a leading image latent (I). Each latent is dependent on its past context. Right: Overview of UniTemp. We distill a teacher model into a uni￾fied autoregressive student G θ trained on its self-rollout in both forward and backward directions. In backward generation, we introduce blockwise anchor latents (dashed … view at source ↗
Figure 4
Figure 4. Figure 4: Long video generation results. Past + future sink latents provide strong conditioning to reduce content variation over long durations. Single-direction long video generation. Tab. 3 compares long video gener￾ation at 10s, 30s, and 100s. Existing training-based long video generation meth￾ods (LongLive [35], Rolling-Forcing [17]) achieve high temporal consistency but produce extremely low-dynamic content (36… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of inbetween video generation. Given the head (leftmost) and tail (rightmost) frames, UniTemp infills temporally coherent content. as outputs in training and inference. When included as outputs in training, the loss is applied to the anchor latents. As discussed in Sec. 4.1, anchor latents themselves are generated without past context. Therefore, once included in the outputs at test time, the… view at source ↗
Figure 6
Figure 6. Figure 6: Looping video generation given the same head and tail frames. Head Generated Tail [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Attention masks for Stage-1 training. Attended latents are filled with green color. (a) Forward causal mask: each block attends to all previously generated blocks. (b) Baseline backward attention mask with block size B=3 and without anchor latents: each block attends to future blocks, while we introduce a dummy initial block (shown in blue) to resolve the image/video latent ambiguity. (c) Baseline backward… view at source ↗
Figure 9
Figure 9. Figure 9: Generation order and attended tokens in Stage-2 training. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
read the original abstract

Autoregressive video diffusion models have emerged as a promising approach for long video generation, achieving strong performance in streaming settings. However, existing methods are restricted to forward temporal generation, whereas practical video creation often requires flexible generation order, e.g., conditioning on future context to extend backward, or on both past and future context for inbetween generation. We bridge this gap by training an autoregressive model that supports generation in arbitrary temporal directions. A key technical challenge arises from the Causal 3D VAE widely used in video diffusion models, which encodes latents strictly conditioned on past context. While suited for forward generation, this causal structure causes inter-block discontinuities when generation proceeds backward. To address this, we introduce blockwise anchor latents, a set of auxiliary latents that restore the missing past context at block boundaries during backward generation. Built on this design, we propose UniTemp, a bidirectional distillation framework that trains a single autoregressive student model for any-direction video generation. At inference time, UniTemp conditions on arbitrary past and/or future frames, improving controllability for both bidirectional and inbetween generation. Experiments show that UniTemp maintains competitive performance on short and long video generation compared to forward-only methods, while enabling diverse workflows such as bidirectional video extension, inbetween generation, looping video generation, scene transition, and visual story generation. Project website: https://lzhangbj.github.io/projects/unitemp/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims to introduce UniTemp, a bidirectional distillation framework that trains a single autoregressive video diffusion model capable of generation in arbitrary temporal orders. It identifies the causal conditioning of the standard 3D VAE as the source of inter-block discontinuities in backward generation and proposes blockwise anchor latents to restore missing past context at boundaries. The resulting model is said to support bidirectional extension, inbetween generation, looping, scene transitions, and visual story generation while maintaining competitive performance on short and long video tasks relative to forward-only baselines.

Significance. If the central technical claim holds, the work would meaningfully expand the practical utility of autoregressive video models by removing the forward-only restriction, enabling new controllable workflows without requiring separate models per direction. The distillation approach for multi-directional capability and the anchor-latent mechanism for causal VAE compatibility are the primary potential contributions.

major comments (1)
  1. [Abstract] Abstract / Method description: the assertion that blockwise anchor latents 'restore the missing past context at block boundaries during backward generation' without side effects is load-bearing for all bidirectional and inbetween claims, yet the provided text supplies neither a quantitative discontinuity metric (e.g., boundary artifact scores before/after anchors) nor an ablation isolating the anchors' contribution. If the anchors only approximate rather than recover exact causal conditioning, the reported performance on looping and inbetween tasks would be undermined.
minor comments (1)
  1. [Abstract] Abstract: no error bars, dataset details, or specific quantitative results (FID, FVD, etc.) are reported to support the 'competitive performance' statement, making direct comparison to forward-only methods difficult to evaluate from the given text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment highlighting the need for stronger empirical support of the blockwise anchor latents. We address the point directly below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract / Method description: the assertion that blockwise anchor latents 'restore the missing past context at block boundaries during backward generation' without side effects is load-bearing for all bidirectional and inbetween claims, yet the provided text supplies neither a quantitative discontinuity metric (e.g., boundary artifact scores before/after anchors) nor an ablation isolating the anchors' contribution. If the anchors only approximate rather than recover exact causal conditioning, the reported performance on looping and inbetween tasks would be undermined.

    Authors: We agree that the current manuscript does not include a dedicated quantitative discontinuity metric or an ablation isolating the anchors. The presented evidence consists of overall task metrics (FVD, CLIP similarity) on bidirectional and inbetween generation plus qualitative examples. In the revised version we will add (1) a boundary artifact score defined as the average L2 distance in VAE latent space (and optionally LPIPS in pixel space) across block boundaries for backward generation with vs. without anchors, and (2) an ablation table reporting performance on looping and inbetween tasks when the anchor mechanism is removed. These additions will directly test whether the anchors recover sufficient causal context or merely approximate it. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation self-contained with no reductions to inputs

full rationale

The abstract and description introduce blockwise anchor latents and bidirectional distillation as new technical components to address causal VAE discontinuities, but contain no equations, no fitted parameters renamed as predictions, and no self-citations invoked as load-bearing uniqueness theorems. Claims of arbitrary-order generation rest on the introduced design rather than tautological redefinitions or self-referential fits. This is the normal case of an externally verifiable engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review is based solely on the abstract; the central technical premise is the causal nature of the 3D VAE and the need for auxiliary latents to restore context.

axioms (1)
  • domain assumption Causal 3D VAE encodes latents strictly conditioned on past context
    Stated as the widely used structure in video diffusion models that creates the backward-generation problem.
invented entities (1)
  • blockwise anchor latents no independent evidence
    purpose: restore the missing past context at block boundaries during backward generation
    Introduced to address inter-block discontinuities caused by the causal VAE.

pith-pipeline@v0.9.1-grok · 5812 in / 1151 out tokens · 30959 ms · 2026-06-26T21:30:14.044939+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 11 linked inside Pith

  1. [1]

    In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorber, D., Levi, Y., English, Z., Voleti, V., Letts, A., Jampani, V., Rombach, R.: Stable video diffusion: Scaling latent video diffusion models to large datasets. In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)

  2. [2]

    arXiv preprint arXiv:2311.15127 (2023)

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., Jampani, V., Rombach, R.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  3. [3]

    OpenAI Technical Report (2024)

    Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video gener- ation models as world simulators. OpenAI Technical Report (2024)

  4. [4]

    Chen, J., Fu, Z., He, X.: Infinite-forcing: Towards infinite-long video generation (2025),https://github.com/SOTAMak1r/Infinite-Forcing

  5. [5]

    arXiv preprint arXiv:2510.02283 (2025)

    Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025)

  6. [6]

    In: AAAI (2024)

    Danier, D., Zhang, F., Bull, D.: Ldmvfi: Video frame interpolation with latent diffusion models. In: AAAI (2024)

  7. [7]

    arXiv preprint arXiv:2403.14611 (2024)

    Feng, H., Ding, Z., Xia, Z., Niklaus, S., Abrevaya, V., Black, M.J., Zhang, X.: Ex- plorative inbetweening of time and space. arXiv preprint arXiv:2403.14611 (2024)

  8. [8]

    In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)

    Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: Consistent diffusion fea- tures for consistent video editing. In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)

  9. [9]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2014)

    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (NeurIPS) (2014)

  10. [10]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) (2025)

    Henschel, R., Khachatryan, L., Poghosyan, H., Hayrapetyan, D., Tadevosyan, V., Wang, Z., Navasardyan, S., Shi, H.: Streamingt2v: Consistent, dynamic, and ex- tendable long video generation from text. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) (2025)

  11. [11]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

  12. [12]

    In: Transactions on Machine Learning Research (TMLR) (2022)

    Höppe, T., Mehrjou, A., Bauer, S., Nielsen, D., Dittadi, A.: Diffusion models for video prediction and infilling. In: Transactions on Machine Learning Research (TMLR) (2022)

  13. [13]

    arXiv preprint arXiv:2506.08009 (2025)

    Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self-forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

  15. [15]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

    Jiang, Z., Han, Z., et al.: Vace: All-in-one video creation and editing. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

  16. [16]

    arXiv preprint arXiv:2412.03603 (2024) 16 L

    Kong, W., Tian, Q., Zhang, Z., Min, R., et al.: Hunyuanvideo: A systematic frame- work for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 16 L. Zhang et al

  17. [17]

    arXiv preprint arXiv:2509.25161 (2025)

    Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161 (2025)

  18. [18]

    arXiv preprint arXiv:2512.04678 (2025)

    Lu, Y., Zeng, Y., Li, H., Ouyang, H., Wang, Q., Cheng, K.L., Zhu, J., Cao, H., Zhang, Z., Zhu, X., Shen, Y., Zhang, M.: Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation. arXiv preprint arXiv:2512.04678 (2025)

  19. [19]

    arXiv preprint arXiv:2501.03575 (2025)

    NVIDIA: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)

  20. [20]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

  21. [22]

    arXiv preprint arXiv:2410.13720 (2024)

    Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.Y., Chuang, C.Y., et al.: Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720 (2024)

  22. [23]

    In: Proceedings of the European Conference on Computer Vision (ECCV)

    Reda, F., Kontkanen, J., Tabellion, E., Sun, D., Pantofaru, C., Curless, B.: FILM: Frame interpolation for large motion. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 250–266 (2022)

  23. [24]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

  24. [25]

    In: Medical Image Computing and Computer-Assisted Intervention (MICCAI) (2015)

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI) (2015)

  25. [26]

    Neurocomputing568, 127063 (2024)

    Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing568, 127063 (2024)

  26. [27]

    arXiv preprint arXiv:2510.08561 (2025)

    Tanveer, M., Zhou, Y., Niklaus, S., Amiri, A.M., Zhang, H., Singh, K.K., Zhao, N.: Multicoin: Multi-modal controllable video inbetweening. arXiv preprint arXiv:2510.08561 (2025)

  27. [28]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2017)

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)

  28. [29]

    In: Advances in Neural Infor- mation Processing Systems (NeurIPS) (2022)

    Voleti, V., Jolicoeur-Martineau, A., Pal, C.: Mcvd: Masked conditional video dif- fusion for prediction, generation, and interpolation. In: Advances in Neural Infor- mation Processing Systems (NeurIPS) (2022)

  29. [30]

    Wan-AI: Wan2.1: Text-to-video generation model.https://github.com/Wan-AI/ Wan2.1(2024)

  30. [31]

    In: NeurIPS Datasets and Benchmarks (2024)

    Wang, W., Yang, Y.: Vidprom: A million-scale real-world video prompt-gallery dataset for text-to-video diffusion models. In: NeurIPS Datasets and Benchmarks (2024)

  31. [32]

    In: Proceedings of the International Conference on Learning Repre- sentations (ICLR) (2025)

    Wang, X., Zhou, B., Curless, B., Kemelmacher-Shlizerman, I., Holynski, A., Seitz, S.M.: Generative inbetweening: Adapting image-to-video models for keyframe in- terpolation. In: Proceedings of the International Conference on Learning Repre- sentations (ICLR) (2025)

  32. [33]

    In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)

    Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M.: Efficient streaming language models with attention sinks. In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)

  33. [34]

    arXiv preprint arXiv:2412.15115 (2024) UniTemp 17

    Yang,A.,Yang,B.,Zhang,B.,Hui,B.,Zheng,B.,Yu,B.,Li,C.,Liu,D.,Huang,F., Wei, H., et al.: Qwen2.5 technical report. arXiv preprint arXiv:2412.15115 (2024) UniTemp 17

  34. [35]

    arXiv preprint arXiv:2509.22622 (2025)

    Yang, S., Huang, W., Chu, R., Xiao, Y., Zhao, Y., Wang, X., Li, M., Xie, E., Chen, Y., Lu, Y., Han, S., Chen, Y.: Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622 (2025)

  35. [36]

    In: Proceedings of the International Conference on Learning Representations (ICLR) (2025)

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. In: Proceedings of the International Conference on Learning Representations (ICLR) (2025)

  36. [37]

    arXiv preprint arXiv:2511.20649 (2025)

    Yesiltepe, H., Meral, T.H.S., Akan, A.K., Oktay, K., Yanardag, P.: Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self- rollout. arXiv preprint arXiv:2511.20649 (2025)

  37. [38]

    arXiv preprint arXiv:2512.05081 (2025)

    Yi, J., Jang, W., Cho, P.H., Nam, J., Yoon, H., Kim, S.: Deep forcing: Training-free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081 (2025)

  38. [39]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

    Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T.: Improved distribution matching distillation for fast image synthesis. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

  39. [40]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

  40. [41]

    arXiv preprint arXiv:2412.07772 (2024)

    Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast causal video generators. arXiv preprint arXiv:2412.07772 (2024)

  41. [42]

    In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)

    Yu, L., Lezama, J., Gundavarapu, N.B., Versari, L., Sohn, K., Minnen, D., Cheng, Y., Gupta, A., Gu, X., Hauptmann, A.G., et al.: Language model beats diffu- sion – tokenizer is key to visual generation. In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)

  42. [43]

    arXiv preprint arXiv:2412.20404 (2024) UniTemp: Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation Supplementary Material We use numbers (e.g., Sec

    Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., You, Y.: Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404 (2024) UniTemp: Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation Supplementary Material We use numbers (e.g., Sec. 1) to refer to the main paper and...

  43. [44]

    With 6 latents in attention, RoPE can thus distinguish the two cases and allow the model to generate correctly

    to condition the generation of the first block (z18, z19, z20). With 6 latents in attention, RoPE can thus distinguish the two cases and allow the model to generate correctly. Loss is not applied on the dummy block. The noise level is sampled independently for the dummy block and the real initial block (z0, z1, z2). In stage-2 training, we also prepend a ...