pith. machine review for the scientific record. sign in

arxiv: 2604.13509 · v1 · submitted 2026-04-15 · 💻 cs.CV

Recognition: unknown

DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords video stylizationdiffusion transformerautoregressive modelreal-time streamingKV cachedistillationtext-guided stylizationreference-guided stylization
0
0 comments X

The pith

RTR-DiT distills a bidirectional Diffusion Transformer into a few-step autoregressive model that stylizes streaming video in real time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RTR-DiT as a streaming video stylization framework built on the Diffusion Transformer architecture. It first fine-tunes a bidirectional teacher model on a curated dataset that supports both text-guided and reference-guided tasks, then distills the teacher into an autoregressive student via Self Forcing and Distribution Matching Distillation to reduce denoising steps. A reference-preserving KV cache update strategy is proposed to keep outputs stable across long sequences and to permit instant switching between text prompts and reference images. This setup directly targets the instability, high compute cost, and multi-step inference that have prevented prior diffusion methods from handling extended videos in practical settings. A sympathetic reader would care because the result makes high-quality, consistent stylization feasible for live or interactive applications where earlier approaches produced drift or required too much time per frame.

Core claim

A bidirectional teacher Diffusion Transformer fine-tuned for video stylization can be distilled into a few-step autoregressive model; when equipped with a reference-preserving KV cache update strategy, the resulting system outperforms prior methods on quantitative metrics and visual quality for both text-guided and reference-guided stylization while delivering stable real-time performance on arbitrarily long videos and supporting interactive style switches without drift.

What carries the argument

The reference-preserving KV cache update strategy, which maintains temporal coherence in the autoregressive DiT while allowing real-time changes between text and reference guidance signals.

If this is right

  • Quantitative metrics and visual quality exceed those of existing text-guided and reference-guided video stylization methods.
  • Long video sequences can be processed stably at real-time speeds.
  • Text prompts and reference images can be switched interactively during generation without introducing artifacts.
  • The framework becomes suitable for immersive applications and artistic creation that require extended or live video output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar distillation and cache strategies could be tested on other autoregressive video generation tasks to reduce latency.
  • The approach may enable live video editing pipelines where style changes occur on the fly.
  • Extending the KV cache preservation to additional conditioning signals could support more complex multi-modal video control.

Load-bearing premise

That the reference-preserving KV cache update strategy combined with the distillation process maintains temporal consistency and visual quality over arbitrarily long videos without introducing artifacts or drift when switching between text and reference guidance.

What would settle it

Stylize a multi-minute video sequence that alternates between text prompts and reference images every few seconds, then inspect whether visual quality or frame-to-frame consistency degrades after thousands of frames.

Figures

Figures reproduced from arXiv: 2604.13509 by Chen Liang, Hanwang Zhang, Hengye Lyu, Jiaxin Shi, Yue Hong, Yueting Weng, Zisu Li.

Figure 1
Figure 1. Figure 1: RTR-DiT is a DiT-based real-time video stylization framework that [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of RTR-DiT. During training, the original video, the target [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Post-training pipeline. We distill the bidirectional teacher model into [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visual comparison with TV2V stylization methods. Our method gener [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual comparison with RV2V stylization methods. Our method achieves [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison results of ablation studies. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Long video stylization example. RTR-DiT maintains stable stylization [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Text/reference-guided switching example. Our model can switch the guid [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

Recent advances in video generation models has significantly accelerated video generation and related downstream tasks. Among these, video stylization holds important research value in areas such as immersive applications and artistic creation, attracting widespread attention. However, existing diffusion-based video stylization methods struggle to maintain stability and consistency when processing long videos, and their high computational cost and multi-step denoising make them difficult to apply in practical scenarios. In this work, we propose RTR-DiT (DiT as Real-Time Rerenderer), a steaming video stylization framework built upon Diffusion Transformer. We first fine-tune a bidirectional teacher model on a curated video stylization dataset, supporting both text-guided and reference-guided video stylization tasks, and subsequently distill it into a few-step autoregressive model via post-training with Self Forcing and Distribution Matching Distillation. Furthermore, we propose a reference-preserving KV cache update strategy that not only enables stable and consistent processing of long videos, but also supports real-time switching between text prompts and reference images. Experimental results show that RTR-DiT outperforms existing methods in both text-guided and reference-guided video stylization tasks, in terms of quantitative metrics and visual quality, and demonstrates excellent performance in real-time long video stylization and interactive style-switching applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents RTR-DiT, a streaming video stylization framework based on Diffusion Transformers. It fine-tunes a bidirectional teacher DiT on a curated dataset supporting both text-guided and reference-guided stylization, distills the teacher into a few-step autoregressive student model via Self-Forcing and Distribution Matching Distillation, and introduces a reference-preserving KV cache update strategy to enable stable long-video processing and real-time switching between text prompts and reference images. The authors claim that RTR-DiT outperforms prior methods in quantitative metrics and visual quality while demonstrating practical real-time performance on long videos and interactive applications.

Significance. If the empirical claims hold, the work could meaningfully advance practical deployment of video stylization by addressing the speed and temporal-stability limitations of multi-step diffusion models. The combination of distillation for few-step inference with a cache-based mechanism for reference preservation offers a concrete path toward interactive, real-time generative video tools.

major comments (1)
  1. The central claim of stable real-time long-video stylization without artifacts or drift depends on the reference-preserving KV cache update strategy (combined with Self-Forcing + DMD distillation). The manuscript supplies no quantitative long-horizon evidence, such as frame-to-frame LPIPS curves, optical-flow consistency metrics, or style-leakage measurements on sequences exceeding training clip length. Autoregressive few-step diffusion is known to be sensitive to cache drift; without explicit tests or regularization over hundreds of frames, the load-bearing assumption remains unverified.
minor comments (1)
  1. Abstract: 'Recent advances in video generation models has significantly accelerated' contains a subject-verb agreement error ('has' should be 'have'). 'Steaming video stylization' is a typographical error for 'streaming video stylization'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for recognizing the potential practical impact of RTR-DiT. The major comment correctly identifies that our stability claims for long-video stylization rest on the reference-preserving KV cache (in conjunction with Self-Forcing and DMD). We address this point directly below and commit to strengthening the manuscript with the requested quantitative evidence.

read point-by-point responses
  1. Referee: The central claim of stable real-time long-video stylization without artifacts or drift depends on the reference-preserving KV cache update strategy (combined with Self-Forcing + DMD distillation). The manuscript supplies no quantitative long-horizon evidence, such as frame-to-frame LPIPS curves, optical-flow consistency metrics, or style-leakage measurements on sequences exceeding training clip length. Autoregressive few-step diffusion is known to be sensitive to cache drift; without explicit tests or regularization over hundreds of frames, the load-bearing assumption remains unverified.

    Authors: We agree that the current manuscript does not provide the specific quantitative long-horizon metrics mentioned. Our evaluations to date have focused on qualitative consistency across long sequences and real-time interactive demonstrations, which show no visible drift or artifacts when using the reference-preserving KV cache. However, we recognize that these do not fully substitute for explicit measurements such as frame-to-frame LPIPS, optical-flow consistency, or style-leakage scores on sequences well beyond training clip length. In the revised manuscript we will add a dedicated long-horizon evaluation section reporting these metrics on videos of several hundred frames, directly testing the cache update strategy under the conditions the referee highlights. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training/distillation pipeline is self-contained

full rationale

The paper presents an engineering pipeline: fine-tune a bidirectional DiT teacher on a curated stylization dataset, distill via Self-Forcing + DMD into a few-step autoregressive model, and apply a reference-preserving KV cache update. These steps are described as training procedures and heuristic design choices validated by quantitative metrics and visual comparisons on held-out clips. No equations, uniqueness theorems, or ansatzes are introduced that reduce a claimed prediction or result back to the fitted inputs or prior self-citations by construction. The central claims rest on external experimental outcomes rather than internal definitional closure.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The approach rests on standard assumptions of diffusion model distillability and the effectiveness of KV caching for state preservation in autoregressive generation; no new physical entities or ad-hoc constants are introduced beyond typical ML hyperparameters.

free parameters (1)
  • number of denoising steps in student model
    Chosen as 'few-step' to balance speed and quality; value not specified in abstract but central to the real-time claim.
axioms (2)
  • domain assumption A bidirectional teacher diffusion model can be distilled into a stable autoregressive few-step student without significant quality loss for video stylization.
    Invoked in the post-training distillation step using Self Forcing and Distribution Matching Distillation.
  • domain assumption Reference information can be preserved across frames via KV cache updates without drift or artifacts in long sequences.
    Central to the reference-preserving KV cache update strategy for consistency.
invented entities (1)
  • RTR-DiT framework no independent evidence
    purpose: Real-time rerenderer for streaming video stylization
    The proposed end-to-end system combining teacher fine-tuning, distillation, and KV cache.

pith-pipeline@v0.9.0 · 5542 in / 1509 out tokens · 50587 ms · 2026-05-10T14:16:06.640468+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 20 canonical work pages · 13 internal anchors

  1. [1]

    AI, K.: Next-generation ai creative studio (2026),https://www.pexels.com/, ac- cessed: 2025-10-23

  2. [2]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  3. [3]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  4. [4]

    Advances in Neural Information Processing Systems37, 24081–24125 (2024)

    Chen, B., Mart´ ı Mons´ o, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024)

  5. [5]

    Flat- ten: optical flow-guided attention for consistent text- to-video editing.arXiv preprint arXiv:2310.05922,

    Cong, Y., Xu, M., Simon, C., Chen, S., Ren, J., Xie, Y., Perez-Rua, J.M., Rosen- hahn, B., Xiang, T., He, S.: Flatten: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922 (2023)

  6. [6]

    Tokenflow: Consistent diffusion features for consistent video editing

    Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: Consistent diffusion fea- tures for consistent video editing. arXiv preprint arXiv:2307.10373 (2023)

  7. [7]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)

  8. [8]

    Advances in neural information processing systems35, 8633– 8646 (2022)

    Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633– 8646 (2022)

  9. [9]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

  10. [10]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024)

  11. [11]

    arXiv preprint arXiv:2310.01107 (2023)

    Jeong, H., Ye, J.C.: Ground-a-video: Zero-shot grounded video editing using text- to-image diffusion models. arXiv preprint arXiv:2310.01107 (2023)

  12. [12]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)

  13. [13]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Kara, O., Kurtkaya, B., Yesiltepe, H., Rehg, J.M., Yanardag, P.: Rave: Random- ized noise shuffling for fast and consistent video editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6507–6516 (2024)

  14. [14]

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V., Yan, J., Chiu, M.C., et al.: Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125 (2023)

  15. [15]

    Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024

    Ku, M., Wei, C., Ren, W., Yang, H., Chen, W.: Anyv2v: A tuning-free framework for any video-to-video editing tasks. arXiv preprint arXiv:2403.14468 (2024)

  16. [16]

    arXiv preprint arXiv:2601.02785 (2026)

    Li, M., Chen, J., Zhao, S., Feng, W., Tu, P., He, Q.: Dreamstyle: A unified frame- work for video stylization. arXiv preprint arXiv:2601.02785 (2026)

  17. [17]

    arXiv preprint arXiv:2501.08316 (2025) 2, 3, 4

    Lin, S., Xia, X., Ren, Y., Yang, C., Xiao, X., Jiang, L.: Diffusion adversarial post- training for one-step video generation. arXiv preprint arXiv:2501.08316 (2025) 16 H. Lyu et al

  18. [18]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

  19. [19]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency mod- els: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)

  20. [20]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

  21. [21]

    Pexels: Pexels (2025),https://www.pexels.com/, accessed: 2025-12-14

  22. [22]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  23. [23]

    Journal of machine learning research21(140), 1–67 (2020)

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research21(140), 1–67 (2020)

  24. [24]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  25. [25]

    Runway: Introducing runway gen-4 (2026),https://runwayml.com/research/ introducing-runway-gen-4, accessed: 2026-01-20

  26. [26]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text- video data. arXiv preprint arXiv:2209.14792 (2022)

  27. [27]

    Measuring style similarity in diffusion models

    Somepalli, G., Gupta, A., Gupta, K., Palta, S., Goldblum, M., Geiping, J., Shri- vastava, A., Goldstein, T.: Measuring style similarity in diffusion models. arXiv preprint arXiv:2404.01292 (2024)

  28. [28]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  29. [29]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

  30. [30]

    MAGI-1: Autoregressive Video Generation at Scale

    Teng, H., Jia, H., Sun, L., Li, L., Li, M., Tang, M., Han, S., Zhang, T., Zhang, W., Luo, W., et al.: Magi-1: Autoregressive video generation at scale. arXiv preprint arXiv:2505.13211 (2025)

  31. [31]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

  32. [32]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Wu, J.Z., Ge, Y., Wang, X., Lei, S.W., Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X., Shou, M.Z.: Tune-a-video: One-shot tuning of image diffusion models for text- to-video generation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7623–7633 (2023)

  33. [33]

    In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

    Wu, Y., Chen, L., Li, R., Wang, S., Xie, C., Zhang, L.: Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 16692– 16701 (2025) DiT as Real-Time Rerenderer 17

  34. [34]

    In: SIGGRAPH Asia 2023 Conference Papers

    Yang, S., Zhou, Y., Liu, Z., Loy, C.C.: Rerender a video: Zero-shot text-guided video-to-video translation. In: SIGGRAPH Asia 2023 Conference Papers. pp. 1–11 (2023)

  35. [35]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yang, S., Zhou, Y., Liu, Z., Loy, C.C.: Fresco: Spatial-temporal correspondence for zero-shot video translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8703–8712 (2024)

  36. [36]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

  37. [37]

    In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

    Ye, Z., Huang, H., Wang, X., Wan, P., Zhang, D., Luo, W.: Stylemaster: Stylize your video with artistic generation and translation. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 2630–2640 (2025)

  38. [38]

    Advances in neural information processing systems37, 47455–47487 (2024)

    Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, B.: Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems37, 47455–47487 (2024)

  39. [39]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6613– 6623 (2024)

  40. [40]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

    Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast autoregressive video diffusion models. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 22963–22974 (2025)

  41. [41]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023)

  42. [42]

    IEEE Transactions on Visualization and Computer Graphics (2025)

    Zhu, H., Xu, Y., Yu, J., He, S.: Zero-shot video translation via token warping. IEEE Transactions on Visualization and Computer Graphics (2025)