SteerVTE: Seamless Video Text Editing with Style and Glyph Control

Kai Zeng; Ming Lu; Moran Li; Qi She; Ruichuan An; Wentao Zhang; Yiheng Lin; Yingchen Yu; Zhengwei Wang

arxiv: 2606.23254 · v1 · pith:AAXCKVCRnew · submitted 2026-06-22 · 💻 cs.CV · cs.AI

SteerVTE: Seamless Video Text Editing with Style and Glyph Control

Kai Zeng , Moran Li , Zhengwei Wang , Yingchen Yu , Yiheng Lin , Ruichuan An , Ming Lu , Qi She

show 1 more author

Wentao Zhang

This is my paper

Pith reviewed 2026-06-26 08:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video text editingdiffusion transformerglyph controlstyle consistencytemporal coherenceadapter modulessynthetic datasetprogressive training

0 comments

The pith

SteerVTE steers a frozen video diffusion model to edit text precisely via style and glyph control without base model retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method to change text inside video frames while keeping the original visual style and smooth motion across time. It freezes an existing video diffusion transformer and adds a small adapter that reads the old text's appearance and encodes the new text at both line and single-character scales. A focused loss term and a training schedule that begins with still images before moving to video clips help the system overcome the base model's limited ability to draw sharp text. A new dataset of one million synthetic examples supports training at scale. Experiments show gains over prior video editing approaches on measures of text legibility, style match, and frame-to-frame stability.

Core claim

SteerVTE attaches a lightweight text context adapter—containing a style encoder for original visual attributes and dual-granularity glyph encoders for target text at line and character levels—to a frozen diffusion transformer; a glyph-aware spatial-focal loss and three-stage image-to-video curriculum then enable precise stroke-level text replacement while preserving stylistic fidelity and temporal coherence.

What carries the argument

Lightweight text context adapter (style encoder plus dual-granularity glyph encoders) plus glyph-aware spatial-focal loss on a frozen diffusion transformer.

If this is right

Text edits remain accurate at the stroke level inside small regions across multiple frames.
Style attributes of the original text are transferred without retraining the underlying video model.
Temporal coherence improves relative to baselines that lack glyph-level guidance.
Training scales efficiently from image data to full video sequences using the one-million-triplet dataset.
The same adapter design supports both style preservation and content replacement in one forward pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on user-provided real-world video rather than only synthetic data to check generalization.
Similar adapters might transfer to other localized editing tasks such as object insertion or color grading.
If the glyph encoders prove robust, they could reduce reliance on large-scale synthetic data in future video models.
Extending the three-stage curriculum to include audio-synchronized text might address subtitle editing scenarios.

Load-bearing premise

Lightweight adapters and a glyph-focused loss can overcome the weak text-drawing ability of frozen video models without introducing visible artifacts in small regions.

What would settle it

A controlled test on video clips containing small text showing no measurable drop in rendering errors or increase in temporal flicker after editing would falsify the claim that the adapters and loss suffice.

read the original abstract

Visual text editing aims to precisely modify text in images and videos while preserving stylistic consistency and visual realism. Despite significant advances in the image domain, video text editing remains largely unexplored: it is a localized task demanding stroke-level precision within small text regions, which compounds the challenges of cross-frame accuracy, temporal coherence, and stylistic fidelity. We introduce SteerVTE, a unified framework that \underline{\textbf{steer}}s a frozen video diffusion model to perform precise \underline{\textbf{V}}ideo \underline{\textbf{T}}ext \underline{\textbf{E}}diting through style and glyph control. Built on a frozen diffusion transformer, SteerVTE attaches a lightweight text context adapter with two complementary modules: a style encoder capturing the original text's visual attributes, and dual-granularity glyph encoders encoding the target text at both the line and character levels. To overcome the inherently weak text rendering priors of video foundation models, we further propose a glyph-aware spatial-focal loss and a three-stage progressive training curriculum that scales from image to video data. To support large-scale training, we also develop an automatic synthesis pipeline and construct SteerVTE-1M, a dataset of one million triplets spanning diverse scenes, fonts, and stylistic effects. Extensive experiments demonstrate that SteerVTE substantially outperforms existing video editing baselines across text accuracy, style consistency, and temporal coherence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SteerVTE adds style and dual-granularity glyph adapters plus a 1M synthetic dataset to steer a frozen video diffusion model for text editing, but the abstract gives no numbers to back the outperformance claim.

read the letter

The main takeaway is that this paper targets video text editing, an area left mostly untouched while image versions advanced. It freezes a diffusion transformer and adds a text context adapter with a style encoder plus dual-granularity glyph encoders at line and character levels. They also introduce a glyph-aware spatial-focal loss and a three-stage curriculum that trains first on images then video. To make this feasible they built an automatic synthesis pipeline and released SteerVTE-1M, a million-triplet dataset covering scenes, fonts, and effects.

Those pieces are the concrete additions. The curriculum and the dual encoders look like reasonable ways to handle stroke-level changes and temporal consistency without touching the base model. The dataset fills a practical gap for training on this task.

The weak point is the evidence. The abstract states substantial gains on text accuracy, style consistency, and temporal coherence, yet supplies no metrics, splits, ablations, or error bars. That makes it impossible to judge whether the adapters actually deliver stroke precision in small regions or whether artifacts appear when the base priors are weak. The decision to avoid any base-model retraining puts the full burden on the lightweight modules; the stress-test concern about small text holds until the experiments are shown.

This work is for people building video generation tools or working on localized editing in media pipelines. A reader already following diffusion-based editing would find the components and the new dataset useful to examine.

It deserves a serious referee. The task is new, the modules are specified, and the dataset is a real contribution even if the performance numbers need verification. Send it for review.

Referee Report

2 major / 0 minor

Summary. The paper introduces SteerVTE, a unified framework for video text editing that steers a frozen video diffusion transformer via a lightweight text context adapter (style encoder plus dual-granularity glyph encoders at line and character levels), a glyph-aware spatial-focal loss, and a three-stage image-to-video curriculum. It also contributes an automatic synthesis pipeline and the SteerVTE-1M dataset of one million triplets, claiming substantial outperformance over video editing baselines on text accuracy, style consistency, and temporal coherence.

Significance. If the empirical claims hold with rigorous validation, the work would be significant for addressing an underexplored task (stroke-level text editing in video) without base-model retraining. The large-scale dataset and adapter-based control mechanism could enable practical applications in video post-production.

major comments (2)

[Abstract] Abstract: the central claim of substantial outperformance across text accuracy, style consistency, and temporal coherence supplies no quantitative numbers, error bars, dataset splits, ablation details, or statistical tests, which is load-bearing for assessing whether the adapters and loss actually compensate for the acknowledged weak text-rendering priors of the frozen base model.
[Abstract] The assumption that lightweight adapters plus glyph-aware loss suffice for stroke-level precision in small text regions (without visible artifacts or coherence failures) is load-bearing for the no-retraining design; this requires explicit testing on challenging cases (tiny fonts, complex styles) that the abstract acknowledges as the core difficulty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address each major comment below and will incorporate revisions to strengthen the abstract and experimental validation as outlined.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of substantial outperformance across text accuracy, style consistency, and temporal coherence supplies no quantitative numbers, error bars, dataset splits, ablation details, or statistical tests, which is load-bearing for assessing whether the adapters and loss actually compensate for the acknowledged weak text-rendering priors of the frozen base model.

Authors: We agree that the abstract would benefit from including key quantitative results. In the revised manuscript, we will update the abstract to report specific metrics (e.g., text accuracy gains of X%, style consistency scores, and temporal coherence improvements from Tables 1-3) with references to the full experimental details, error bars, and dataset information already present in Sections 4 and 5. This will make the central claims more self-contained while preserving brevity. revision: yes
Referee: [Abstract] The assumption that lightweight adapters plus glyph-aware loss suffice for stroke-level precision in small text regions (without visible artifacts or coherence failures) is load-bearing for the no-retraining design; this requires explicit testing on challenging cases (tiny fonts, complex styles) that the abstract acknowledges as the core difficulty.

Authors: The SteerVTE-1M dataset and our experiments already encompass diverse challenging cases including tiny fonts, complex styles, and small text regions, as described in the dataset construction and evaluation protocols. The dual-granularity glyph encoders and glyph-aware loss are specifically motivated to handle stroke-level precision. To directly address the concern, we will add a targeted analysis subsection with quantitative and qualitative results on these edge cases, confirming the absence of visible artifacts or coherence failures under the frozen-base design. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on experiments, not definitional reductions

full rationale

The paper introduces SteerVTE as an empirical framework attaching lightweight adapters and a glyph-aware loss to a frozen diffusion model, with performance claims supported by experiments on a synthesized dataset. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described content that would reduce any result to its inputs by construction. The central claims of outperformance are presented as experimental outcomes rather than derived identities, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented physical entities are stated. The work relies on the standard assumption that frozen diffusion transformers can be steered by small adapters and that synthetic data can substitute for real annotated video text examples.

pith-pipeline@v0.9.1-grok · 5802 in / 1238 out tokens · 14215 ms · 2026-06-26T08:50:40.606888+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 17 linked inside Pith

[1]

Tongyi wan 2.7 video generation, 2026

Alibaba Cloud. Tongyi wan 2.7 video generation, 2026. https://tongyi.aliyun.com/wan/generate/video/ generate?model=wan2.7

2026
[2]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[3]

Videopainter: Any-length video inpainting and editing with plug-and-play context control

Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, and Qiang Xu. Videopainter: Any-length video inpainting and editing with plug-and-play context control. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–12, 2025

2025
[4]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023

2023
[5]

Diffute: Universal text editing diffusion model.Advances in Neural Information Processing Systems, 36:63062–63074, 2023

Haoxing Chen, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Changhua Meng, Huijia Zhu, Weiqiang Wang, et al. Diffute: Universal text editing diffusion model.Advances in Neural Information Processing Systems, 36:63062–63074, 2023

2023
[6]

Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353–9387, 2023

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353–9387, 2023

2023
[7]

Textdiffuser-2: Unleashing the power of language models for text rendering

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024

2024
[8]

Consistent video-to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023

Jiaxin Cheng, Tianjun Xiao, and Tong He. Consistent video-to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023

arXiv 2023
[9]

Viva: Vlm-guided instruction-based video editing with reward optimization.arXiv preprint arXiv:2512.16906, 2025

Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang, Canyu Zhang, and Chongyang Ma. Viva: Vlm-guided instruction-based video editing with reward optimization.arXiv preprint arXiv:2512.16906, 2025

arXiv 2025
[10]

Flatten: optical flow-guided attention for consistent text-to-video editing

Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. Flatten: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922, 2023

arXiv 2023
[11]

Peak signal-to-noise ratio, 2026.https://en.wikipedia.org/w/index.php?title=Peak_ signal-to-noise_ratio&oldid=1210897995

Wikipedia contributors. Peak signal-to-noise ratio, 2026.https://en.wikipedia.org/w/index.php?title=Peak_ signal-to-noise_ratio&oldid=1210897995

2026
[12]

Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595, 2025

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595, 2025

Pith/arXiv arXiv 2025
[13]

Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

2021
[14]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning, 2024

2024
[15]

Videoshop: Localized semantic video editing with noise-extrapolated diffusion inversion.arXiv preprint arXiv:2403.14617, 2024

Xiang Fan, Anand Bhattad, and Ranjay Krishna. Videoshop: Localized semantic video editing with noise-extrapolated diffusion inversion.arXiv preprint arXiv:2403.14617, 2024

arXiv 2024
[16]

google-10000-english, 2026.https://github.com/first20hours/google-10000-english

first20hours. google-10000-english, 2026.https://github.com/first20hours/google-10000-english

2026
[17]

Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023

Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023

Pith/arXiv arXiv 2023
[18]

Google fonts, 2026.https://fonts.google.com/

Google. Google fonts, 2026.https://fonts.google.com/

2026
[19]

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. InProceedings of the 23rd international conference on Machine learning, pages 369–376, 2006

2006
[20]

Videoswap: Customized video subject swapping with interactive semantic point correspondence

Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, and Kevin Tang. Videoswap: Customized video subject swapping with interactive semantic point correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 11

2024
[21]

Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

Pith/arXiv arXiv 2024
[22]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

2017
[23]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[24]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

2022
[25]

Vivid-10m: A dataset and baseline for versatile and interactive video local editing.arXiv preprint arXiv:2411.15260, 2024

Jiahao Hu, Tianxiong Zhong, Xuebo Wang, Boyuan Jiang, Xingye Tian, Fei Yang, Pengfei Wan, and Di Zhang. Vivid-10m: A dataset and baseline for versatile and interactive video local editing.arXiv preprint arXiv:2411.15260, 2024

arXiv 2024
[26]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025

2025
[27]

Rave: Randomized noise shuffling for fast and consistent video editing with diffusion models

Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M Rehg, and Pinar Yanardag. Rave: Randomized noise shuffling for fast and consistent video editing with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[28]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024
[29]

Anyv2v: A plug-and-play framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

Max Ku, Cong Wei, Weiming Ren, Huan Yang, and Wenhu Chen. Anyv2v: A plug-and-play framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

arXiv 2024
[30]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

Pith/arXiv arXiv 2025
[31]

Flux-text: A simple and advanced diffusion transformer baseline for scene text editing.arXiv preprint arXiv:2505.03329, 2025

Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Lei Sun, and Xiangxiang Chu. Flux-text: A simple and advanced diffusion transformer baseline for scene text editing.arXiv preprint arXiv:2505.03329, 2025

arXiv 2025
[32]

Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026

Pith/arXiv arXiv 2026
[33]

Vidtome: Video token merging for zero-shot video editing

Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang. Vidtome: Video token merging for zero-shot video editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[34]

In-context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025

Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu, and Guosheng Lin. In-context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025

arXiv 2025
[35]

Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

Pith/arXiv arXiv 2024
[36]

Kiwi-edit: Versatile video editing via instruction and reference guidance.arXiv preprint arXiv:2603.02175, 2026

Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, and Mike Zheng Shou. Kiwi-edit: Versatile video editing via instruction and reference guidance.arXiv preprint arXiv:2603.02175, 2026

Pith/arXiv arXiv 2026
[37]

Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022
[38]

Generative video propagation.arXiv preprint arXiv:2412.19761, 2024

Shaoteng Liu, Tianyu Wang, Jui-Hsien Wang, Qing Liu, Zhifei Zhang, Joon-Young Lee, Yijun Li, Bei Yu, Zhe Lin, Soo Ye Kim, et al. Generative video propagation.arXiv preprint arXiv:2412.19761, 2024

arXiv 2024
[39]

Glyph-byt5: A customized text encoder for accurate visual text rendering

Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, and Yuhui Yuan. Glyph-byt5: A customized text encoder for accurate visual text rendering. InEuropean Conference on Computer Vision, pages 361–377. Springer, 2024. 12

2024
[40]

Glyphdraw: Seamlessly rendering text with intricate spatial structures in text-to-image generation.arXiv preprint arXiv:2303.17870, 2023

Jian Ma, Mingjun Zhao, Chen Chen, Ruichen Wang, Di Niu, Haonan Lu, and Xiaodong Lin. Glyphdraw: Seamlessly rendering text with intricate spatial structures in text-to-image generation.arXiv preprint arXiv:2303.17870, 2023

arXiv 2023
[41]

Um-text: A unified multimodal model for image understanding.arXiv preprint arXiv:2601.08321, 2026

Lichen Ma, Xiaolong Fu, Gaojing Zhou, Zipeng Guo, Ting Zhu, Yichun Liu, Yu Shi, Jason Li, and Junshi Huang. Um-text: A unified multimodal model for image understanding.arXiv preprint arXiv:2601.08321, 2026

Pith/arXiv arXiv 2026
[42]

Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023
[43]

Codef: Content deformation fields for temporally consistent video processing

Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, and Yujun Shen. Codef: Content deformation fields for temporally consistent video processing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[44]

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models.arXiv preprint arXiv:2405.16537, 2024

Wenqi Ouyang, Yi Dong, Lei Yang, Jianlou Si, and Xingang Pan. I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models.arXiv preprint arXiv:2405.16537, 2024

arXiv 2024
[45]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[46]

Fatezero: Fusing attentions for zero-shot text-based video editing

Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

2023
[47]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[48]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[49]

Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

Pith/arXiv arXiv 2026
[50]

Stellar: Scene text editor for low-resource languages and real-world data.arXiv preprint arXiv:2511.09977, 2025

Yongdeuk Seo, Hyun-seok Min, and Sungchul Choi. Stellar: Scene text editor for low-resource languages and real-world data.arXiv preprint arXiv:2511.09977, 2025

arXiv 2025
[51]

Fonts: Text rendering with typography and style controls

Wenda Shi, Yiren Song, Dengming Zhang, Jiaming Liu, and Xingxing Zou. Fonts: Text rendering with typography and style controls. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18463–18474, 2025

2025
[52]

Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

Pith/arXiv arXiv 2010
[53]

Dopi: Doctor-like proactive interrogation llm for traditional chinese medicine.arXiv preprint arXiv:2507.04877, 2025

Zewen Sun, Ruoxiang Huang, Jiahe Feng, Rundong Kong, Yuqian Wang, Hengyu Liu, Ziqi Gong, Yuyuan Qin, Yingxue Wang, and Yu Wang. Dopi: Doctor-like proactive interrogation llm for traditional chinese medicine.arXiv preprint arXiv:2507.04877, 2025

arXiv 2025
[54]

Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, and Hao Li. Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

arXiv 2025
[55]

Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

arXiv 2023
[56]

Anytext2: Visual text generation and editing with customizable attributes.arXiv preprint arXiv:2411.15245, 2024

Yuxiang Tuo, Yifeng Geng, and Liefeng Bo. Anytext2: Visual text generation and editing with customizable attributes.arXiv preprint arXiv:2411.15245, 2024

arXiv 2024
[57]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019

2019
[58]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[59]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 13

2004
[60]

Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

arXiv 2025
[61]

Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

Pith/arXiv arXiv 2025
[62]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

2023
[63]

A bilingual, openworld video text dataset and end-to-end video text spotter with transformer.arXiv preprint arXiv:2112.04888, 2021

Weijia Wu, Yuanqiang Cai, Debing Zhang, Sibo Wang, Zhuang Li, Jiahong Li, Yejun Tang, and Hong Zhou. A bilingual, openworld video text dataset and end-to-end video text spotter with transformer.arXiv preprint arXiv:2112.04888, 2021

arXiv 2021
[64]

Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024
[65]

Textctrl: Diffusion-based scene text editing with prior guidance control.Advances in Neural Information Processing Systems, 37:138569–138594, 2024

Weichao Zeng, Yan Shu, Zhenhang Li, Dongbao Yang, and Yu Zhou. Textctrl: Diffusion-based scene text editing with prior guidance control.Advances in Neural Information Processing Systems, 37:138569–138594, 2024

2024
[66]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

2023
[67]

Effived: Efficient video editing via text-instruction diffusion models.arXiv preprint arXiv:2403.11568, 2024

Zhenghao Zhang, Zuozhuo Dai, Long Qin, and Weizhi Wang. Effived: Efficient video editing via text-instruction diffusion models.arXiv preprint arXiv:2403.11568, 2024

arXiv 2024
[68]

Utdesign: A unified framework for stylized text editing and generation in graphic design images

Yiming Zhao, Yuanpeng Gao, Yuxuan Luo, Jiwei Duan, Shisong Lin, Longfei Xiong, and Zhouhui Lian. Utdesign: A unified framework for stylized text editing and generation in graphic design images. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

2025
[69]

Describe in detail the typography, color, style, text material, and rendering effects of the text regions {source text} in this image

Yuanzhi Zhu, Jiawei Liu, Feiyu Gao, Wenyu Liu, Xinggang Wang, Peng Wang, Fei Huang, Cong Yao, and Zhibo Yang. Visual text generation in the wild. InEuropean Conference on Computer Vision, pages 89–106. Springer, 2024. 14 Appendix A Outlines The supplementary material presents the following sections to strengthen the main manuscript: • Section B.1 presents...

2024

[1] [1]

Tongyi wan 2.7 video generation, 2026

Alibaba Cloud. Tongyi wan 2.7 video generation, 2026. https://tongyi.aliyun.com/wan/generate/video/ generate?model=wan2.7

2026

[2] [2]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[3] [3]

Videopainter: Any-length video inpainting and editing with plug-and-play context control

Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, and Qiang Xu. Videopainter: Any-length video inpainting and editing with plug-and-play context control. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–12, 2025

2025

[4] [4]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023

2023

[5] [5]

Diffute: Universal text editing diffusion model.Advances in Neural Information Processing Systems, 36:63062–63074, 2023

Haoxing Chen, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Changhua Meng, Huijia Zhu, Weiqiang Wang, et al. Diffute: Universal text editing diffusion model.Advances in Neural Information Processing Systems, 36:63062–63074, 2023

2023

[6] [6]

Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353–9387, 2023

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353–9387, 2023

2023

[7] [7]

Textdiffuser-2: Unleashing the power of language models for text rendering

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024

2024

[8] [8]

Consistent video-to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023

Jiaxin Cheng, Tianjun Xiao, and Tong He. Consistent video-to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023

arXiv 2023

[9] [9]

Viva: Vlm-guided instruction-based video editing with reward optimization.arXiv preprint arXiv:2512.16906, 2025

Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang, Canyu Zhang, and Chongyang Ma. Viva: Vlm-guided instruction-based video editing with reward optimization.arXiv preprint arXiv:2512.16906, 2025

arXiv 2025

[10] [10]

Flatten: optical flow-guided attention for consistent text-to-video editing

Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. Flatten: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922, 2023

arXiv 2023

[11] [11]

Peak signal-to-noise ratio, 2026.https://en.wikipedia.org/w/index.php?title=Peak_ signal-to-noise_ratio&oldid=1210897995

Wikipedia contributors. Peak signal-to-noise ratio, 2026.https://en.wikipedia.org/w/index.php?title=Peak_ signal-to-noise_ratio&oldid=1210897995

2026

[12] [12]

Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595, 2025

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595, 2025

Pith/arXiv arXiv 2025

[13] [13]

Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

2021

[14] [14]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning, 2024

2024

[15] [15]

Videoshop: Localized semantic video editing with noise-extrapolated diffusion inversion.arXiv preprint arXiv:2403.14617, 2024

Xiang Fan, Anand Bhattad, and Ranjay Krishna. Videoshop: Localized semantic video editing with noise-extrapolated diffusion inversion.arXiv preprint arXiv:2403.14617, 2024

arXiv 2024

[16] [16]

google-10000-english, 2026.https://github.com/first20hours/google-10000-english

first20hours. google-10000-english, 2026.https://github.com/first20hours/google-10000-english

2026

[17] [17]

Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023

Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023

Pith/arXiv arXiv 2023

[18] [18]

Google fonts, 2026.https://fonts.google.com/

Google. Google fonts, 2026.https://fonts.google.com/

2026

[19] [19]

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. InProceedings of the 23rd international conference on Machine learning, pages 369–376, 2006

2006

[20] [20]

Videoswap: Customized video subject swapping with interactive semantic point correspondence

Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, and Kevin Tang. Videoswap: Customized video subject swapping with interactive semantic point correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 11

2024

[21] [21]

Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

Pith/arXiv arXiv 2024

[22] [22]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

2017

[23] [23]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020

[24] [24]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

2022

[25] [25]

Vivid-10m: A dataset and baseline for versatile and interactive video local editing.arXiv preprint arXiv:2411.15260, 2024

Jiahao Hu, Tianxiong Zhong, Xuebo Wang, Boyuan Jiang, Xingye Tian, Fei Yang, Pengfei Wan, and Di Zhang. Vivid-10m: A dataset and baseline for versatile and interactive video local editing.arXiv preprint arXiv:2411.15260, 2024

arXiv 2024

[26] [26]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025

2025

[27] [27]

Rave: Randomized noise shuffling for fast and consistent video editing with diffusion models

Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M Rehg, and Pinar Yanardag. Rave: Randomized noise shuffling for fast and consistent video editing with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[28] [28]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024

[29] [29]

Anyv2v: A plug-and-play framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

Max Ku, Cong Wei, Weiming Ren, Huan Yang, and Wenhu Chen. Anyv2v: A plug-and-play framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

arXiv 2024

[30] [30]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

Pith/arXiv arXiv 2025

[31] [31]

Flux-text: A simple and advanced diffusion transformer baseline for scene text editing.arXiv preprint arXiv:2505.03329, 2025

Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Lei Sun, and Xiangxiang Chu. Flux-text: A simple and advanced diffusion transformer baseline for scene text editing.arXiv preprint arXiv:2505.03329, 2025

arXiv 2025

[32] [32]

Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026

Pith/arXiv arXiv 2026

[33] [33]

Vidtome: Video token merging for zero-shot video editing

Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang. Vidtome: Video token merging for zero-shot video editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[34] [34]

In-context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025

Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu, and Guosheng Lin. In-context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025

arXiv 2025

[35] [35]

Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

Pith/arXiv arXiv 2024

[36] [36]

Kiwi-edit: Versatile video editing via instruction and reference guidance.arXiv preprint arXiv:2603.02175, 2026

Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, and Mike Zheng Shou. Kiwi-edit: Versatile video editing via instruction and reference guidance.arXiv preprint arXiv:2603.02175, 2026

Pith/arXiv arXiv 2026

[37] [37]

Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022

[38] [38]

Generative video propagation.arXiv preprint arXiv:2412.19761, 2024

Shaoteng Liu, Tianyu Wang, Jui-Hsien Wang, Qing Liu, Zhifei Zhang, Joon-Young Lee, Yijun Li, Bei Yu, Zhe Lin, Soo Ye Kim, et al. Generative video propagation.arXiv preprint arXiv:2412.19761, 2024

arXiv 2024

[39] [39]

Glyph-byt5: A customized text encoder for accurate visual text rendering

Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, and Yuhui Yuan. Glyph-byt5: A customized text encoder for accurate visual text rendering. InEuropean Conference on Computer Vision, pages 361–377. Springer, 2024. 12

2024

[40] [40]

Glyphdraw: Seamlessly rendering text with intricate spatial structures in text-to-image generation.arXiv preprint arXiv:2303.17870, 2023

Jian Ma, Mingjun Zhao, Chen Chen, Ruichen Wang, Di Niu, Haonan Lu, and Xiaodong Lin. Glyphdraw: Seamlessly rendering text with intricate spatial structures in text-to-image generation.arXiv preprint arXiv:2303.17870, 2023

arXiv 2023

[41] [41]

Um-text: A unified multimodal model for image understanding.arXiv preprint arXiv:2601.08321, 2026

Lichen Ma, Xiaolong Fu, Gaojing Zhou, Zipeng Guo, Ting Zhu, Yichun Liu, Yu Shi, Jason Li, and Junshi Huang. Um-text: A unified multimodal model for image understanding.arXiv preprint arXiv:2601.08321, 2026

Pith/arXiv arXiv 2026

[42] [42]

Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023

[43] [43]

Codef: Content deformation fields for temporally consistent video processing

Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, and Yujun Shen. Codef: Content deformation fields for temporally consistent video processing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[44] [44]

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models.arXiv preprint arXiv:2405.16537, 2024

Wenqi Ouyang, Yi Dong, Lei Yang, Jianlou Si, and Xingang Pan. I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models.arXiv preprint arXiv:2405.16537, 2024

arXiv 2024

[45] [45]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[46] [46]

Fatezero: Fusing attentions for zero-shot text-based video editing

Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

2023

[47] [47]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[48] [48]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022

[49] [49]

Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

Pith/arXiv arXiv 2026

[50] [50]

Stellar: Scene text editor for low-resource languages and real-world data.arXiv preprint arXiv:2511.09977, 2025

Yongdeuk Seo, Hyun-seok Min, and Sungchul Choi. Stellar: Scene text editor for low-resource languages and real-world data.arXiv preprint arXiv:2511.09977, 2025

arXiv 2025

[51] [51]

Fonts: Text rendering with typography and style controls

Wenda Shi, Yiren Song, Dengming Zhang, Jiaming Liu, and Xingxing Zou. Fonts: Text rendering with typography and style controls. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18463–18474, 2025

2025

[52] [52]

Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

Pith/arXiv arXiv 2010

[53] [53]

Dopi: Doctor-like proactive interrogation llm for traditional chinese medicine.arXiv preprint arXiv:2507.04877, 2025

Zewen Sun, Ruoxiang Huang, Jiahe Feng, Rundong Kong, Yuqian Wang, Hengyu Liu, Ziqi Gong, Yuyuan Qin, Yingxue Wang, and Yu Wang. Dopi: Doctor-like proactive interrogation llm for traditional chinese medicine.arXiv preprint arXiv:2507.04877, 2025

arXiv 2025

[54] [54]

Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, and Hao Li. Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

arXiv 2025

[55] [55]

Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

arXiv 2023

[56] [56]

Anytext2: Visual text generation and editing with customizable attributes.arXiv preprint arXiv:2411.15245, 2024

Yuxiang Tuo, Yifeng Geng, and Liefeng Bo. Anytext2: Visual text generation and editing with customizable attributes.arXiv preprint arXiv:2411.15245, 2024

arXiv 2024

[57] [57]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019

2019

[58] [58]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[59] [59]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 13

2004

[60] [60]

Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

arXiv 2025

[61] [61]

Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

Pith/arXiv arXiv 2025

[62] [62]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

2023

[63] [63]

A bilingual, openworld video text dataset and end-to-end video text spotter with transformer.arXiv preprint arXiv:2112.04888, 2021

Weijia Wu, Yuanqiang Cai, Debing Zhang, Sibo Wang, Zhuang Li, Jiahong Li, Yejun Tang, and Hong Zhou. A bilingual, openworld video text dataset and end-to-end video text spotter with transformer.arXiv preprint arXiv:2112.04888, 2021

arXiv 2021

[64] [64]

Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024

[65] [65]

Textctrl: Diffusion-based scene text editing with prior guidance control.Advances in Neural Information Processing Systems, 37:138569–138594, 2024

Weichao Zeng, Yan Shu, Zhenhang Li, Dongbao Yang, and Yu Zhou. Textctrl: Diffusion-based scene text editing with prior guidance control.Advances in Neural Information Processing Systems, 37:138569–138594, 2024

2024

[66] [66]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

2023

[67] [67]

Effived: Efficient video editing via text-instruction diffusion models.arXiv preprint arXiv:2403.11568, 2024

Zhenghao Zhang, Zuozhuo Dai, Long Qin, and Weizhi Wang. Effived: Efficient video editing via text-instruction diffusion models.arXiv preprint arXiv:2403.11568, 2024

arXiv 2024

[68] [68]

Utdesign: A unified framework for stylized text editing and generation in graphic design images

Yiming Zhao, Yuanpeng Gao, Yuxuan Luo, Jiwei Duan, Shisong Lin, Longfei Xiong, and Zhouhui Lian. Utdesign: A unified framework for stylized text editing and generation in graphic design images. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

2025

[69] [69]

Describe in detail the typography, color, style, text material, and rendering effects of the text regions {source text} in this image

Yuanzhi Zhu, Jiawei Liu, Feiyu Gao, Wenyu Liu, Xinggang Wang, Peng Wang, Fei Huang, Cong Yao, and Zhibo Yang. Visual text generation in the wild. InEuropean Conference on Computer Vision, pages 89–106. Springer, 2024. 14 Appendix A Outlines The supplementary material presents the following sections to strengthen the main manuscript: • Section B.1 presents...

2024