SteerVTE: Seamless Video Text Editing with Style and Glyph Control
Pith reviewed 2026-06-26 08:50 UTC · model grok-4.3
The pith
SteerVTE steers a frozen video diffusion model to edit text precisely via style and glyph control without base model retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SteerVTE attaches a lightweight text context adapter—containing a style encoder for original visual attributes and dual-granularity glyph encoders for target text at line and character levels—to a frozen diffusion transformer; a glyph-aware spatial-focal loss and three-stage image-to-video curriculum then enable precise stroke-level text replacement while preserving stylistic fidelity and temporal coherence.
What carries the argument
Lightweight text context adapter (style encoder plus dual-granularity glyph encoders) plus glyph-aware spatial-focal loss on a frozen diffusion transformer.
If this is right
- Text edits remain accurate at the stroke level inside small regions across multiple frames.
- Style attributes of the original text are transferred without retraining the underlying video model.
- Temporal coherence improves relative to baselines that lack glyph-level guidance.
- Training scales efficiently from image data to full video sequences using the one-million-triplet dataset.
- The same adapter design supports both style preservation and content replacement in one forward pass.
Where Pith is reading between the lines
- The approach could be tested on user-provided real-world video rather than only synthetic data to check generalization.
- Similar adapters might transfer to other localized editing tasks such as object insertion or color grading.
- If the glyph encoders prove robust, they could reduce reliance on large-scale synthetic data in future video models.
- Extending the three-stage curriculum to include audio-synchronized text might address subtitle editing scenarios.
Load-bearing premise
Lightweight adapters and a glyph-focused loss can overcome the weak text-drawing ability of frozen video models without introducing visible artifacts in small regions.
What would settle it
A controlled test on video clips containing small text showing no measurable drop in rendering errors or increase in temporal flicker after editing would falsify the claim that the adapters and loss suffice.
read the original abstract
Visual text editing aims to precisely modify text in images and videos while preserving stylistic consistency and visual realism. Despite significant advances in the image domain, video text editing remains largely unexplored: it is a localized task demanding stroke-level precision within small text regions, which compounds the challenges of cross-frame accuracy, temporal coherence, and stylistic fidelity. We introduce SteerVTE, a unified framework that \underline{\textbf{steer}}s a frozen video diffusion model to perform precise \underline{\textbf{V}}ideo \underline{\textbf{T}}ext \underline{\textbf{E}}diting through style and glyph control. Built on a frozen diffusion transformer, SteerVTE attaches a lightweight text context adapter with two complementary modules: a style encoder capturing the original text's visual attributes, and dual-granularity glyph encoders encoding the target text at both the line and character levels. To overcome the inherently weak text rendering priors of video foundation models, we further propose a glyph-aware spatial-focal loss and a three-stage progressive training curriculum that scales from image to video data. To support large-scale training, we also develop an automatic synthesis pipeline and construct SteerVTE-1M, a dataset of one million triplets spanning diverse scenes, fonts, and stylistic effects. Extensive experiments demonstrate that SteerVTE substantially outperforms existing video editing baselines across text accuracy, style consistency, and temporal coherence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SteerVTE, a unified framework for video text editing that steers a frozen video diffusion transformer via a lightweight text context adapter (style encoder plus dual-granularity glyph encoders at line and character levels), a glyph-aware spatial-focal loss, and a three-stage image-to-video curriculum. It also contributes an automatic synthesis pipeline and the SteerVTE-1M dataset of one million triplets, claiming substantial outperformance over video editing baselines on text accuracy, style consistency, and temporal coherence.
Significance. If the empirical claims hold with rigorous validation, the work would be significant for addressing an underexplored task (stroke-level text editing in video) without base-model retraining. The large-scale dataset and adapter-based control mechanism could enable practical applications in video post-production.
major comments (2)
- [Abstract] Abstract: the central claim of substantial outperformance across text accuracy, style consistency, and temporal coherence supplies no quantitative numbers, error bars, dataset splits, ablation details, or statistical tests, which is load-bearing for assessing whether the adapters and loss actually compensate for the acknowledged weak text-rendering priors of the frozen base model.
- [Abstract] The assumption that lightweight adapters plus glyph-aware loss suffice for stroke-level precision in small text regions (without visible artifacts or coherence failures) is load-bearing for the no-retraining design; this requires explicit testing on challenging cases (tiny fonts, complex styles) that the abstract acknowledges as the core difficulty.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback. We address each major comment below and will incorporate revisions to strengthen the abstract and experimental validation as outlined.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of substantial outperformance across text accuracy, style consistency, and temporal coherence supplies no quantitative numbers, error bars, dataset splits, ablation details, or statistical tests, which is load-bearing for assessing whether the adapters and loss actually compensate for the acknowledged weak text-rendering priors of the frozen base model.
Authors: We agree that the abstract would benefit from including key quantitative results. In the revised manuscript, we will update the abstract to report specific metrics (e.g., text accuracy gains of X%, style consistency scores, and temporal coherence improvements from Tables 1-3) with references to the full experimental details, error bars, and dataset information already present in Sections 4 and 5. This will make the central claims more self-contained while preserving brevity. revision: yes
-
Referee: [Abstract] The assumption that lightweight adapters plus glyph-aware loss suffice for stroke-level precision in small text regions (without visible artifacts or coherence failures) is load-bearing for the no-retraining design; this requires explicit testing on challenging cases (tiny fonts, complex styles) that the abstract acknowledges as the core difficulty.
Authors: The SteerVTE-1M dataset and our experiments already encompass diverse challenging cases including tiny fonts, complex styles, and small text regions, as described in the dataset construction and evaluation protocols. The dual-granularity glyph encoders and glyph-aware loss are specifically motivated to handle stroke-level precision. To directly address the concern, we will add a targeted analysis subsection with quantitative and qualitative results on these edge cases, confirming the absence of visible artifacts or coherence failures under the frozen-base design. revision: yes
Circularity Check
No circularity: empirical claims rest on experiments, not definitional reductions
full rationale
The paper introduces SteerVTE as an empirical framework attaching lightweight adapters and a glyph-aware loss to a frozen diffusion model, with performance claims supported by experiments on a synthesized dataset. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described content that would reduce any result to its inputs by construction. The central claims of outperformance are presented as experimental outcomes rather than derived identities, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Tongyi wan 2.7 video generation, 2026
Alibaba Cloud. Tongyi wan 2.7 video generation, 2026. https://tongyi.aliyun.com/wan/generate/video/ generate?model=wan2.7
2026
-
[2]
Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Pith/arXiv arXiv 2025
-
[3]
Videopainter: Any-length video inpainting and editing with plug-and-play context control
Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, and Qiang Xu. Videopainter: Any-length video inpainting and editing with plug-and-play context control. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–12, 2025
2025
-
[4]
Instructpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023
2023
-
[5]
Diffute: Universal text editing diffusion model.Advances in Neural Information Processing Systems, 36:63062–63074, 2023
Haoxing Chen, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Changhua Meng, Huijia Zhu, Weiqiang Wang, et al. Diffute: Universal text editing diffusion model.Advances in Neural Information Processing Systems, 36:63062–63074, 2023
2023
-
[6]
Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353–9387, 2023
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353–9387, 2023
2023
-
[7]
Textdiffuser-2: Unleashing the power of language models for text rendering
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024
2024
-
[8]
Consistent video-to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023
Jiaxin Cheng, Tianjun Xiao, and Tong He. Consistent video-to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023
arXiv 2023
-
[9]
Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang, Canyu Zhang, and Chongyang Ma. Viva: Vlm-guided instruction-based video editing with reward optimization.arXiv preprint arXiv:2512.16906, 2025
arXiv 2025
-
[10]
Flatten: optical flow-guided attention for consistent text-to-video editing
Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. Flatten: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922, 2023
arXiv 2023
-
[11]
Peak signal-to-noise ratio, 2026.https://en.wikipedia.org/w/index.php?title=Peak_ signal-to-noise_ratio&oldid=1210897995
Wikipedia contributors. Peak signal-to-noise ratio, 2026.https://en.wikipedia.org/w/index.php?title=Peak_ signal-to-noise_ratio&oldid=1210897995
2026
-
[12]
Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595, 2025
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595, 2025
Pith/arXiv arXiv 2025
-
[13]
Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021
2021
-
[14]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning, 2024
2024
-
[15]
Xiang Fan, Anand Bhattad, and Ranjay Krishna. Videoshop: Localized semantic video editing with noise-extrapolated diffusion inversion.arXiv preprint arXiv:2403.14617, 2024
arXiv 2024
-
[16]
google-10000-english, 2026.https://github.com/first20hours/google-10000-english
first20hours. google-10000-english, 2026.https://github.com/first20hours/google-10000-english
2026
-
[17]
Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023
Pith/arXiv arXiv 2023
-
[18]
Google fonts, 2026.https://fonts.google.com/
Google. Google fonts, 2026.https://fonts.google.com/
2026
-
[19]
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. InProceedings of the 23rd international conference on Machine learning, pages 369–376, 2006
2006
-
[20]
Videoswap: Customized video subject swapping with interactive semantic point correspondence
Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, and Kevin Tang. Videoswap: Customized video subject swapping with interactive semantic point correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 11
2024
-
[21]
Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024
Pith/arXiv arXiv 2024
-
[22]
Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017
2017
-
[23]
Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
2020
-
[24]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
2022
-
[25]
Jiahao Hu, Tianxiong Zhong, Xuebo Wang, Boyuan Jiang, Xingye Tian, Fei Yang, Pengfei Wan, and Di Zhang. Vivid-10m: A dataset and baseline for versatile and interactive video local editing.arXiv preprint arXiv:2411.15260, 2024
arXiv 2024
-
[26]
Vace: All-in-one video creation and editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025
2025
-
[27]
Rave: Randomized noise shuffling for fast and consistent video editing with diffusion models
Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M Rehg, and Pinar Yanardag. Rave: Randomized noise shuffling for fast and consistent video editing with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[28]
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
Pith/arXiv arXiv 2024
-
[29]
Max Ku, Cong Wei, Weiming Ren, Huan Yang, and Wenhu Chen. Anyv2v: A plug-and-play framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024
arXiv 2024
-
[30]
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025
Pith/arXiv arXiv 2025
-
[31]
Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Lei Sun, and Xiangxiang Chu. Flux-text: A simple and advanced diffusion transformer baseline for scene text editing.arXiv preprint arXiv:2505.03329, 2025
arXiv 2025
-
[32]
Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026
Pith/arXiv arXiv 2026
-
[33]
Vidtome: Video token merging for zero-shot video editing
Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang. Vidtome: Video token merging for zero-shot video editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[34]
Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu, and Guosheng Lin. In-context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025
arXiv 2025
-
[35]
Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024
Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024
Pith/arXiv arXiv 2024
-
[36]
Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, and Mike Zheng Shou. Kiwi-edit: Versatile video editing via instruction and reference guidance.arXiv preprint arXiv:2603.02175, 2026
Pith/arXiv arXiv 2026
-
[37]
Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
Pith/arXiv arXiv 2022
-
[38]
Generative video propagation.arXiv preprint arXiv:2412.19761, 2024
Shaoteng Liu, Tianyu Wang, Jui-Hsien Wang, Qing Liu, Zhifei Zhang, Joon-Young Lee, Yijun Li, Bei Yu, Zhe Lin, Soo Ye Kim, et al. Generative video propagation.arXiv preprint arXiv:2412.19761, 2024
arXiv 2024
-
[39]
Glyph-byt5: A customized text encoder for accurate visual text rendering
Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, and Yuhui Yuan. Glyph-byt5: A customized text encoder for accurate visual text rendering. InEuropean Conference on Computer Vision, pages 361–377. Springer, 2024. 12
2024
-
[40]
Jian Ma, Mingjun Zhao, Chen Chen, Ruichen Wang, Di Niu, Haonan Lu, and Xiaodong Lin. Glyphdraw: Seamlessly rendering text with intricate spatial structures in text-to-image generation.arXiv preprint arXiv:2303.17870, 2023
arXiv 2023
-
[41]
Um-text: A unified multimodal model for image understanding.arXiv preprint arXiv:2601.08321, 2026
Lichen Ma, Xiaolong Fu, Gaojing Zhou, Zipeng Guo, Ting Zhu, Yichun Liu, Yu Shi, Jason Li, and Junshi Huang. Um-text: A unified multimodal model for image understanding.arXiv preprint arXiv:2601.08321, 2026
Pith/arXiv arXiv 2026
-
[42]
Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
Pith/arXiv arXiv 2023
-
[43]
Codef: Content deformation fields for temporally consistent video processing
Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, and Yujun Shen. Codef: Content deformation fields for temporally consistent video processing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[44]
Wenqi Ouyang, Yi Dong, Lei Yang, Jianlou Si, and Xingang Pan. I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models.arXiv preprint arXiv:2405.16537, 2024
arXiv 2024
-
[45]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[46]
Fatezero: Fusing attentions for zero-shot text-based video editing
Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023
2023
-
[47]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
2021
-
[48]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
2022
-
[49]
Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026
Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026
Pith/arXiv arXiv 2026
-
[50]
Yongdeuk Seo, Hyun-seok Min, and Sungchul Choi. Stellar: Scene text editor for low-resource languages and real-world data.arXiv preprint arXiv:2511.09977, 2025
arXiv 2025
-
[51]
Fonts: Text rendering with typography and style controls
Wenda Shi, Yiren Song, Dengming Zhang, Jiaming Liu, and Xingxing Zou. Fonts: Text rendering with typography and style controls. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18463–18474, 2025
2025
-
[52]
Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020
Pith/arXiv arXiv 2010
-
[53]
Zewen Sun, Ruoxiang Huang, Jiahe Feng, Rundong Kong, Yuqian Wang, Hengyu Liu, Ziqi Gong, Yuyuan Qin, Yingxue Wang, and Yu Wang. Dopi: Doctor-like proactive interrogation llm for traditional chinese medicine.arXiv preprint arXiv:2507.04877, 2025
arXiv 2025
-
[54]
Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, and Hao Li. Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025
arXiv 2025
-
[55]
Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023
Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023
arXiv 2023
-
[56]
Yuxiang Tuo, Yifeng Geng, and Liefeng Bo. Anytext2: Visual text generation and editing with customizable attributes.arXiv preprint arXiv:2411.15245, 2024
arXiv 2024
-
[57]
Fvd: A new metric for video generation
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019
2019
-
[58]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Pith/arXiv arXiv 2025
-
[59]
Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 13
2004
-
[60]
Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025
arXiv 2025
-
[61]
Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
Pith/arXiv arXiv 2025
-
[62]
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023
2023
-
[63]
Weijia Wu, Yuanqiang Cai, Debing Zhang, Sibo Wang, Zhuang Li, Jiahong Li, Yejun Tang, and Hong Zhou. A bilingual, openworld video text dataset and end-to-end video text spotter with transformer.arXiv preprint arXiv:2112.04888, 2021
arXiv 2021
-
[64]
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
Pith/arXiv arXiv 2024
-
[65]
Textctrl: Diffusion-based scene text editing with prior guidance control.Advances in Neural Information Processing Systems, 37:138569–138594, 2024
Weichao Zeng, Yan Shu, Zhenhang Li, Dongbao Yang, and Yu Zhou. Textctrl: Diffusion-based scene text editing with prior guidance control.Advances in Neural Information Processing Systems, 37:138569–138594, 2024
2024
-
[66]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023
2023
-
[67]
Zhenghao Zhang, Zuozhuo Dai, Long Qin, and Weizhi Wang. Effived: Efficient video editing via text-instruction diffusion models.arXiv preprint arXiv:2403.11568, 2024
arXiv 2024
-
[68]
Utdesign: A unified framework for stylized text editing and generation in graphic design images
Yiming Zhao, Yuanpeng Gao, Yuxuan Luo, Jiwei Duan, Shisong Lin, Longfei Xiong, and Zhouhui Lian. Utdesign: A unified framework for stylized text editing and generation in graphic design images. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025
2025
-
[69]
Describe in detail the typography, color, style, text material, and rendering effects of the text regions {source text} in this image
Yuanzhi Zhu, Jiawei Liu, Feiyu Gao, Wenyu Liu, Xinggang Wang, Peng Wang, Fei Huang, Cong Yao, and Zhibo Yang. Visual text generation in the wild. InEuropean Conference on Computer Vision, pages 89–106. Springer, 2024. 14 Appendix A Outlines The supplementary material presents the following sections to strengthen the main manuscript: • Section B.1 presents...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.