arxiv: 2604.13509 · v1 · submitted 2026-04-15 · 💻 cs.CV

Recognition: unknown

DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer

Hengye Lyu , Zisu Li , Yue Hong , Yueting Weng , Jiaxin Shi , Hanwang Zhang , Chen Liang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords video stylizationdiffusion transformerautoregressive modelreal-time streamingKV cachedistillationtext-guided stylizationreference-guided stylization

0 comments

The pith

RTR-DiT distills a bidirectional Diffusion Transformer into a few-step autoregressive model that stylizes streaming video in real time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RTR-DiT as a streaming video stylization framework built on the Diffusion Transformer architecture. It first fine-tunes a bidirectional teacher model on a curated dataset that supports both text-guided and reference-guided tasks, then distills the teacher into an autoregressive student via Self Forcing and Distribution Matching Distillation to reduce denoising steps. A reference-preserving KV cache update strategy is proposed to keep outputs stable across long sequences and to permit instant switching between text prompts and reference images. This setup directly targets the instability, high compute cost, and multi-step inference that have prevented prior diffusion methods from handling extended videos in practical settings. A sympathetic reader would care because the result makes high-quality, consistent stylization feasible for live or interactive applications where earlier approaches produced drift or required too much time per frame.

Core claim

A bidirectional teacher Diffusion Transformer fine-tuned for video stylization can be distilled into a few-step autoregressive model; when equipped with a reference-preserving KV cache update strategy, the resulting system outperforms prior methods on quantitative metrics and visual quality for both text-guided and reference-guided stylization while delivering stable real-time performance on arbitrarily long videos and supporting interactive style switches without drift.

What carries the argument

The reference-preserving KV cache update strategy, which maintains temporal coherence in the autoregressive DiT while allowing real-time changes between text and reference guidance signals.

If this is right

Quantitative metrics and visual quality exceed those of existing text-guided and reference-guided video stylization methods.
Long video sequences can be processed stably at real-time speeds.
Text prompts and reference images can be switched interactively during generation without introducing artifacts.
The framework becomes suitable for immersive applications and artistic creation that require extended or live video output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar distillation and cache strategies could be tested on other autoregressive video generation tasks to reduce latency.
The approach may enable live video editing pipelines where style changes occur on the fly.
Extending the KV cache preservation to additional conditioning signals could support more complex multi-modal video control.

Load-bearing premise

That the reference-preserving KV cache update strategy combined with the distillation process maintains temporal consistency and visual quality over arbitrarily long videos without introducing artifacts or drift when switching between text and reference guidance.

What would settle it

Stylize a multi-minute video sequence that alternates between text prompts and reference images every few seconds, then inspect whether visual quality or frame-to-frame consistency degrades after thousands of frames.

Figures

Figures reproduced from arXiv: 2604.13509 by Chen Liang, Hanwang Zhang, Hengye Lyu, Jiaxin Shi, Yue Hong, Yueting Weng, Zisu Li.

**Figure 2.** Figure 2: Overview of RTR-DiT. During training, the original video, the target [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Post-training pipeline. We distill the bidirectional teacher model into [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Visual comparison with TV2V stylization methods. Our method gener [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Visual comparison with RV2V stylization methods. Our method achieves [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison results of ablation studies. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Long video stylization example. RTR-DiT maintains stable stylization [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Text/reference-guided switching example. Our model can switch the guid [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Recent advances in video generation models has significantly accelerated video generation and related downstream tasks. Among these, video stylization holds important research value in areas such as immersive applications and artistic creation, attracting widespread attention. However, existing diffusion-based video stylization methods struggle to maintain stability and consistency when processing long videos, and their high computational cost and multi-step denoising make them difficult to apply in practical scenarios. In this work, we propose RTR-DiT (DiT as Real-Time Rerenderer), a steaming video stylization framework built upon Diffusion Transformer. We first fine-tune a bidirectional teacher model on a curated video stylization dataset, supporting both text-guided and reference-guided video stylization tasks, and subsequently distill it into a few-step autoregressive model via post-training with Self Forcing and Distribution Matching Distillation. Furthermore, we propose a reference-preserving KV cache update strategy that not only enables stable and consistent processing of long videos, but also supports real-time switching between text prompts and reference images. Experimental results show that RTR-DiT outperforms existing methods in both text-guided and reference-guided video stylization tasks, in terms of quantitative metrics and visual quality, and demonstrates excellent performance in real-time long video stylization and interactive style-switching applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RTR-DiT gives a practical distillation-plus-KV-cache route to real-time DiT video stylization, but the long-horizon consistency claim rests on thin evidence.

read the letter

The main takeaway is that this work turns a bidirectional DiT teacher into a few-step autoregressive student via Self Forcing and Distribution Matching Distillation, then adds a reference-preserving KV cache so the model can stream stylization, keep a reference image stable, and switch between text and reference guidance without restarting the cache. That combination is the concrete engineering step the paper contributes. It directly targets the latency and drift problems that have kept diffusion stylization out of live or long-form use cases. The cache update rule is a straightforward way to condition on the reference without full re-encoding each frame, and the distillation choices follow recent few-step work in a sensible way for video. If the numbers in the full experiments section are solid, this could be a usable recipe for people who need interactive stylization tools. The experiments claim better quantitative metrics and visual quality than prior methods on both text-guided and reference-guided tasks, plus real-time long-video demos. That is the part worth checking in detail. The soft spot is exactly the one the stress-test flags: the KV cache is supposed to stop error accumulation over arbitrarily long sequences and during guidance switches, but the paper does not appear to report frame-to-frame consistency curves, optical-flow drift, or LPIPS trends on sequences hundreds of frames beyond the training clip length. Short-clip ablations and standard metrics do not automatically prove the long-horizon claim. Without those checks or periodic teacher forcing, drift or style leakage remains a plausible risk even if short results look clean. This paper is for groups already working on real-time video diffusion or stylization pipelines who want a concrete starting point rather than a new theoretical result. Readers focused on deployment constraints will find the most value. It is coherent on its own terms and engages the relevant distillation literature, so it deserves a serious referee who can press on the consistency experiments and ask for longer-sequence diagnostics. I would send it to review rather than desk-reject.

Referee Report

1 major / 1 minor

Summary. The manuscript presents RTR-DiT, a streaming video stylization framework based on Diffusion Transformers. It fine-tunes a bidirectional teacher DiT on a curated dataset supporting both text-guided and reference-guided stylization, distills the teacher into a few-step autoregressive student model via Self-Forcing and Distribution Matching Distillation, and introduces a reference-preserving KV cache update strategy to enable stable long-video processing and real-time switching between text prompts and reference images. The authors claim that RTR-DiT outperforms prior methods in quantitative metrics and visual quality while demonstrating practical real-time performance on long videos and interactive applications.

Significance. If the empirical claims hold, the work could meaningfully advance practical deployment of video stylization by addressing the speed and temporal-stability limitations of multi-step diffusion models. The combination of distillation for few-step inference with a cache-based mechanism for reference preservation offers a concrete path toward interactive, real-time generative video tools.

major comments (1)

The central claim of stable real-time long-video stylization without artifacts or drift depends on the reference-preserving KV cache update strategy (combined with Self-Forcing + DMD distillation). The manuscript supplies no quantitative long-horizon evidence, such as frame-to-frame LPIPS curves, optical-flow consistency metrics, or style-leakage measurements on sequences exceeding training clip length. Autoregressive few-step diffusion is known to be sensitive to cache drift; without explicit tests or regularization over hundreds of frames, the load-bearing assumption remains unverified.

minor comments (1)

Abstract: 'Recent advances in video generation models has significantly accelerated' contains a subject-verb agreement error ('has' should be 'have'). 'Steaming video stylization' is a typographical error for 'streaming video stylization'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for recognizing the potential practical impact of RTR-DiT. The major comment correctly identifies that our stability claims for long-video stylization rest on the reference-preserving KV cache (in conjunction with Self-Forcing and DMD). We address this point directly below and commit to strengthening the manuscript with the requested quantitative evidence.

read point-by-point responses

Referee: The central claim of stable real-time long-video stylization without artifacts or drift depends on the reference-preserving KV cache update strategy (combined with Self-Forcing + DMD distillation). The manuscript supplies no quantitative long-horizon evidence, such as frame-to-frame LPIPS curves, optical-flow consistency metrics, or style-leakage measurements on sequences exceeding training clip length. Autoregressive few-step diffusion is known to be sensitive to cache drift; without explicit tests or regularization over hundreds of frames, the load-bearing assumption remains unverified.

Authors: We agree that the current manuscript does not provide the specific quantitative long-horizon metrics mentioned. Our evaluations to date have focused on qualitative consistency across long sequences and real-time interactive demonstrations, which show no visible drift or artifacts when using the reference-preserving KV cache. However, we recognize that these do not fully substitute for explicit measurements such as frame-to-frame LPIPS, optical-flow consistency, or style-leakage scores on sequences well beyond training clip length. In the revised manuscript we will add a dedicated long-horizon evaluation section reporting these metrics on videos of several hundred frames, directly testing the cache update strategy under the conditions the referee highlights. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training/distillation pipeline is self-contained

full rationale

The paper presents an engineering pipeline: fine-tune a bidirectional DiT teacher on a curated stylization dataset, distill via Self-Forcing + DMD into a few-step autoregressive model, and apply a reference-preserving KV cache update. These steps are described as training procedures and heuristic design choices validated by quantitative metrics and visual comparisons on held-out clips. No equations, uniqueness theorems, or ansatzes are introduced that reduce a claimed prediction or result back to the fitted inputs or prior self-citations by construction. The central claims rest on external experimental outcomes rather than internal definitional closure.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The approach rests on standard assumptions of diffusion model distillability and the effectiveness of KV caching for state preservation in autoregressive generation; no new physical entities or ad-hoc constants are introduced beyond typical ML hyperparameters.

free parameters (1)

number of denoising steps in student model
Chosen as 'few-step' to balance speed and quality; value not specified in abstract but central to the real-time claim.

axioms (2)

domain assumption A bidirectional teacher diffusion model can be distilled into a stable autoregressive few-step student without significant quality loss for video stylization.
Invoked in the post-training distillation step using Self Forcing and Distribution Matching Distillation.
domain assumption Reference information can be preserved across frames via KV cache updates without drift or artifacts in long sequences.
Central to the reference-preserving KV cache update strategy for consistency.

invented entities (1)

RTR-DiT framework no independent evidence
purpose: Real-time rerenderer for streaming video stylization
The proposed end-to-end system combining teacher fine-tuning, distillation, and KV cache.

pith-pipeline@v0.9.0 · 5542 in / 1509 out tokens · 50587 ms · 2026-05-10T14:16:06.640468+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 20 canonical work pages · 13 internal anchors

[1]

AI, K.: Next-generation ai creative studio (2026),https://www.pexels.com/, ac- cessed: 2025-10-23

2026
[2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review arXiv 2023
[4]

Advances in Neural Information Processing Systems37, 24081–24125 (2024)

Chen, B., Mart´ ı Mons´ o, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024)

2024
[5]

Flat- ten: optical flow-guided attention for consistent text- to-video editing.arXiv preprint arXiv:2310.05922,

Cong, Y., Xu, M., Simon, C., Chen, S., Ren, J., Xie, Y., Perez-Rua, J.M., Rosen- hahn, B., Xiang, T., He, S.: Flatten: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922 (2023)

work page arXiv 2023
[6]

Tokenflow: Consistent diffusion features for consistent video editing

Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: Consistent diffusion fea- tures for consistent video editing. arXiv preprint arXiv:2307.10373 (2023)

work page arXiv 2023
[7]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)

work page internal anchor Pith review arXiv 2023
[8]

Advances in neural information processing systems35, 8633– 8646 (2022)

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633– 8646 (2022)

2022
[9]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review arXiv 2025
[10]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024)

2024
[11]

arXiv preprint arXiv:2310.01107 (2023)

Jeong, H., Ye, J.C.: Ground-a-video: Zero-shot grounded video editing using text- to-image diffusion models. arXiv preprint arXiv:2310.01107 (2023)

work page arXiv 2023
[12]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)

2025
[13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Kara, O., Kurtkaya, B., Yesiltepe, H., Rehg, J.M., Yanardag, P.: Rave: Random- ized noise shuffling for fast and consistent video editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6507–6516 (2024)

2024
[14]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V., Yan, J., Chiu, M.C., et al.: Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125 (2023)

work page internal anchor Pith review arXiv 2023
[15]

Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024

Ku, M., Wei, C., Ren, W., Yang, H., Chen, W.: Anyv2v: A tuning-free framework for any video-to-video editing tasks. arXiv preprint arXiv:2403.14468 (2024)

work page arXiv 2024
[16]

arXiv preprint arXiv:2601.02785 (2026)

Li, M., Chen, J., Zhao, S., Feng, W., Tu, P., He, Q.: Dreamstyle: A unified frame- work for video stylization. arXiv preprint arXiv:2601.02785 (2026)

work page arXiv 2026
[17]

arXiv preprint arXiv:2501.08316 (2025) 2, 3, 4

Lin, S., Xia, X., Ren, Y., Yang, C., Xiao, X., Jiang, L.: Diffusion adversarial post- training for one-step video generation. arXiv preprint arXiv:2501.08316 (2025) 16 H. Lyu et al

work page arXiv 2025
[18]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency mod- els: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)

work page internal anchor Pith review arXiv 2023
[20]

In: Proceedings of the IEEE/CVF international conference on computer vision

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

2023
[21]

Pexels: Pexels (2025),https://www.pexels.com/, accessed: 2025-12-14

2025
[22]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

2021
[23]

Journal of machine learning research21(140), 1–67 (2020)

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research21(140), 1–67 (2020)

2020
[24]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

2022
[25]

Runway: Introducing runway gen-4 (2026),https://runwayml.com/research/ introducing-runway-gen-4, accessed: 2026-01-20

2026
[26]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text- video data. arXiv preprint arXiv:2209.14792 (2022)

work page internal anchor Pith review arXiv 2022
[27]

Measuring style similarity in diffusion models

Somepalli, G., Gupta, A., Gupta, K., Palta, S., Goldblum, M., Geiping, J., Shri- vastava, A., Goldstein, T.: Measuring style similarity in diffusion models. arXiv preprint arXiv:2404.01292 (2024)

work page arXiv 2024
[28]

Denoising Diffusion Implicit Models

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[29]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2011
[30]

MAGI-1: Autoregressive Video Generation at Scale

Teng, H., Jia, H., Sun, L., Li, L., Li, M., Tang, M., Han, S., Zhang, T., Zhang, W., Luo, W., et al.: Magi-1: Autoregressive video generation at scale. arXiv preprint arXiv:2505.13211 (2025)

work page internal anchor Pith review arXiv 2025
[31]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

In: Proceedings of the IEEE/CVF international conference on computer vision

Wu, J.Z., Ge, Y., Wang, X., Lei, S.W., Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X., Shou, M.Z.: Tune-a-video: One-shot tuning of image diffusion models for text- to-video generation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7623–7633 (2023)

2023
[33]

In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

Wu, Y., Chen, L., Li, R., Wang, S., Xie, C., Zhang, L.: Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 16692– 16701 (2025) DiT as Real-Time Rerenderer 17

2025
[34]

In: SIGGRAPH Asia 2023 Conference Papers

Yang, S., Zhou, Y., Liu, Z., Loy, C.C.: Rerender a video: Zero-shot text-guided video-to-video translation. In: SIGGRAPH Asia 2023 Conference Papers. pp. 1–11 (2023)

2023
[35]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yang, S., Zhou, Y., Liu, Z., Loy, C.C.: Fresco: Spatial-temporal correspondence for zero-shot video translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8703–8712 (2024)

2024
[36]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review arXiv 2024
[37]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Ye, Z., Huang, H., Wang, X., Wan, P., Zhang, D., Luo, W.: Stylemaster: Stylize your video with artistic generation and translation. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 2630–2640 (2025)

2025
[38]

Advances in neural information processing systems37, 47455–47487 (2024)

Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, B.: Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems37, 47455–47487 (2024)

2024
[39]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6613– 6623 (2024)

2024
[40]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast autoregressive video diffusion models. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 22963–22974 (2025)

2025
[41]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023)

2023
[42]

IEEE Transactions on Visualization and Computer Graphics (2025)

Zhu, H., Xu, Y., Yu, J., He, S.: Zero-shot video translation via token warping. IEEE Transactions on Visualization and Computer Graphics (2025)

2025