pith. sign in

arxiv: 2605.19320 · v1 · pith:NKMPEODNnew · submitted 2026-05-19 · 💻 cs.CV · cs.DB

TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

Pith reviewed 2026-05-20 06:55 UTC · model grok-4.3

classification 💻 cs.CV cs.DB
keywords text renderingpreference alignmenttext-to-image generationhierarchical rewardvision-language modelOCR accuracypost-training optimizationDPO and GRPO
0
0 comments X

The pith

Text rendering in image generators improves by aligning preferences with a hierarchical VLM reward that judges errors at global, word, and glyph levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that text rendering can be improved in existing text-to-image models by framing it as a post-training preference alignment task instead of modifying the model architecture. It introduces a hierarchical vision-language model reward that breaks rendering defects into global layout, word, and individual glyph levels and turns those judgments into a scalar signal for optimization. This signal works with both Group Relative Policy Optimization and Direct Preference Optimization. Experiments on FLUX.1-dev and Z-Image-Turbo demonstrate higher OCR accuracy on rendered text while keeping overall image quality intact. A sympathetic reader cares because the method offers a way to fix a common failure mode without redesigning large foundation models.

Core claim

Text rendering is studied as a post-training preference-alignment problem. A hierarchical VLM-based reward decomposes rendering errors into global, word, and glyph levels, converts binary defect judgments into a scalar preference signal, and supports both GRPO and DPO. This produces consistent gains in OCR-based text accuracy on FLUX.1-dev and Z-Image-Turbo without degrading general generation quality, outperforming baselines such as SD3.5, Qwen-Image, AnyText, and TextDiffuser.

What carries the argument

hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels then converts binary defect judgments into a scalar preference signal

If this is right

  • OCR accuracy on text in generated images rises consistently on the tested foundation models.
  • General image generation quality remains unchanged or is preserved.
  • The same reward signal works for both GRPO and DPO optimization.
  • The approach compares favorably to existing text-rendering methods that require architecture changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The hierarchical reward design could transfer to other fine-grained control tasks such as layout or style in generative models.
  • Multi-scale error decomposition might help in related domains like video generation where text elements must remain legible across frames.
  • Reward modeling focused on specific capabilities may reduce the need for full model retraining when scaling foundation systems.

Load-bearing premise

The hierarchical VLM-based reward accurately decomposes and judges rendering errors at global, word, and glyph levels to produce a reliable scalar preference signal that improves the generator.

What would settle it

Running the TextAlign process on a new text-to-image model and measuring no improvement in OCR accuracy on generated text or a measurable drop in general image quality compared with the unaligned base model.

Figures

Figures reproduced from arXiv: 2605.19320 by Fajri Koto, Fengxian Ji, Jiaming Wang, Jingpu Yang, Mingxuan Cui, Qian Jiang, Xiuying Chen, Zhecheng Shi, Zirui Song.

Figure 1
Figure 1. Figure 1: Text rendering results. Representative 720 × 720 samples generated by our aligned models. TextAlign renders legible and well-formed visual text across diverse carriers, styles, layouts, and text lengths while preserving coherent image content. model can require non-trivial engineering and may disturb the pretrained generative prior that gives modern models their broad visual competence. We take a different… view at source ↗
Figure 2
Figure 2. Figure 2: Our hierarchical reward mechanism. Given a generated image x and reference text y, three independent VLM calls produce binary indicators at the global, word and glyph levels, which are aggregated into a scalar reward R that drives either GRPO or DPO. model’s qualitative judgement into parsable signals. Let Nv ≤ N denote the number of indicators successfully parsed for a given sample. We define the scalar r… view at source ↗
Figure 3
Figure 3. Figure 3: User study. Human preference votes on text fidelity and visual integration. Our GRPO-aligned models outperform prior base￾lines and base generators on both criteria, with Z-Image (Our GRPO) preferred most. 4.4 Evaluation on External Dataset To test whether the gains from TextAlign transfer beyond our constructed benchmark, we further evaluate the same models on a 500-sample split of the external MARIO-Eval… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of text rendering results. Given the same prompts, GRPO￾aligned FLUX and Z-Image produce more faithful and legible visual text while preserving the surrounding visual context. F1-score, although some ablated variants slightly improve a single metric such as NED or strict accuracy. Overall, the three reward levels are complementary: global feedback stabilizes readable text structure, … view at source ↗
Figure 5
Figure 5. Figure 5: Robustness to text length and spatial placement. Radar visualizations of FLUX (Our GRPO) and Z-Image-Turbo (Our GRPO) across text-length and position subsets. Academic Advertisement Artistic Basic Cover Handwriting Logo Poster Scene Sticker [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results across visual categories. Z-Image (Our GRPO) renders legible text across diverse visual text scenarios while preserving category-specific style and layout. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

Faithful text rendering remains a persistent weakness of large text-to-image generative models, as it requires both semantic instruction following and fine-grained glyph-level structure. Prior methods often improve this ability through architecture-specific modules or encoder modifications, which complicate deployment across foundation models. We study text rendering as a post-training preference-alignment problem and propose TextAlign, a non-invasive framework that keeps the generator architecture unchanged. The key component is a hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal. The resulting signal supports both Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO). Experiments on FLUX.1-dev and Z-Image-Turbo show consistent gains in OCR-based text accuracy without degrading general generation quality. Compared with strong foundation and text-rendering baselines, including SD3.5, Qwen-Image, AnyText, and TextDiffuser, these results indicate that reward design offers a scalable alternative to model redesign for improving text rendering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TextAlign, a non-invasive post-training framework for improving text rendering in text-to-image models via preference alignment. It employs a hierarchical VLM-based reward that decomposes rendering errors at global, word, and glyph levels, converting binary defect judgments into scalar preference signals usable with GRPO or DPO. Experiments on FLUX.1-dev and Z-Image-Turbo claim consistent OCR accuracy gains without degrading general generation quality, outperforming baselines such as SD3.5, Qwen-Image, AnyText, and TextDiffuser.

Significance. If the central empirical claims hold, the work demonstrates that reward design can serve as a scalable alternative to architecture-specific modifications for enhancing fine-grained text rendering in foundation models. This approach preserves model compatibility and could generalize across generators. The hierarchical decomposition targets the multi-scale nature of text errors, which is a strength if the VLM judgments prove reliable and stable.

major comments (2)
  1. [§3 (Method)] The central claim rests on the hierarchical VLM reward producing an accurate and stable preference signal. No validation of the VLM's binary defect judgments (e.g., inter-rater agreement with humans or error analysis at the glyph level) is provided, despite known limitations of VLMs on fine-grained visual reasoning; this directly affects whether the reported OCR gains on FLUX.1-dev and Z-Image-Turbo can be attributed to the method.
  2. [§4 (Experiments)] Experiments assert consistent OCR-based text accuracy gains without degrading quality and superiority to baselines, yet the available description provides no quantitative metrics, error bars, dataset sizes, ablation results, or statistical tests. This leaves the magnitude, reliability, and specificity of the improvements unverified.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., OCR accuracy delta) to support the empirical claims.
  2. [§3.2] Notation for the scalar preference signal derivation from binary judgments could be clarified with an explicit equation or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions that will be incorporated to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3 (Method)] The central claim rests on the hierarchical VLM reward producing an accurate and stable preference signal. No validation of the VLM's binary defect judgments (e.g., inter-rater agreement with humans or error analysis at the glyph level) is provided, despite known limitations of VLMs on fine-grained visual reasoning; this directly affects whether the reported OCR gains on FLUX.1-dev and Z-Image-Turbo can be attributed to the method.

    Authors: We agree that explicit validation of the VLM judgments would strengthen attribution of the OCR gains to the hierarchical reward. The current manuscript supports the reward's utility through consistent downstream OCR improvements and outperformance versus baselines, but we acknowledge this is indirect evidence. We will add a dedicated validation subsection with human inter-rater agreement (Cohen's kappa) on a sampled set of global/word/glyph judgments and a glyph-level error breakdown comparing VLM decisions to human annotations. revision: yes

  2. Referee: [§4 (Experiments)] Experiments assert consistent OCR-based text accuracy gains without degrading quality and superiority to baselines, yet the available description provides no quantitative metrics, error bars, dataset sizes, ablation results, or statistical tests. This leaves the magnitude, reliability, and specificity of the improvements unverified.

    Authors: We appreciate the request for greater experimental transparency. The manuscript contains tables reporting OCR accuracy on FLUX.1-dev and Z-Image-Turbo together with baseline comparisons, but we agree the presentation can be expanded. In revision we will include per-experiment dataset sizes, standard deviations or error bars on OCR metrics, ablation results isolating each level of the hierarchical reward, and paired statistical tests (e.g., Wilcoxon signed-rank) to establish significance of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity: reward signal derived from external VLM judgments

full rationale

The paper describes TextAlign as a post-training preference alignment method that applies a hierarchical VLM-based reward to decompose text rendering errors at global, word, and glyph levels, converting binary defect judgments into a scalar preference signal for GRPO and DPO. No equations, derivations, or self-referential definitions appear in the abstract or described framework that reduce the claimed OCR accuracy gains to fitted parameters or tautological inputs by construction. The reward originates from external VLM evaluations rather than internal fits or self-citations that bear the central load, and experiments on FLUX.1-dev and Z-Image-Turbo are presented as empirical validation against baselines. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a VLM can reliably produce multi-level binary defect judgments convertible to scalar preferences. No free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption A vision-language model can accurately judge text rendering defects at global, word, and glyph levels and convert these into a useful preference signal.
    This underpins the entire reward model and subsequent alignment training described in the abstract.

pith-pipeline@v0.9.0 · 5742 in / 1216 out tokens · 27511 ms · 2026-05-20T06:55:41.783937+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 7 internal anchors

  1. [1]

    Black Forest Labs. FLUX.1. https://blackforestlabs.ai/, 2024. Text-to-image model suite and release documentation

  2. [2]

    J. Chen, Y . Huang, T. Lv, L. Cui, Q. Chen, and F. Wei. Textdiffuser: Diffusion models as text painters. In Advances in Neural Information Processing Systems, 2023

  3. [3]

    J. Chen, Y . Huang, T. Lv, L. Cui, Q. Chen, and F. Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024

  4. [4]

    Q. Chen, Y . Ma, H. Wang, J. Yuan, W. Zhao, Q. Tian, H. Wang, S. Min, Q. Chen, and W. Liu. Infinite- canvas: Higher-resolution video outpainting with extensive content generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 2150–2158, 2025

  5. [5]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y . Marek, and R. Rombach. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024

  6. [6]

    Gao, J.-Y

    L. Gao, J.-Y . He, Y . Zeng, Y . Zhong, X. Sun, J. Hu, Z. Gao, and X. Wei. Vitype: High-fidelity visual text rendering via glyph-aware multimodal diffusion. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4131–4139, 2026

  7. [7]

    Hessel, A

    J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y . Choi. CLIPScore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021

  8. [8]

    X. Hu, K. Xu, B. Liu, Q. Liu, and H. Fei. Amo sampler: Enhancing text rendering with overshooting. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13157–13166, 2025

  9. [9]

    F. Ji, J. Yang, Z. Song, L. Gao, J. Liang, Z. Chen, J. Zhang, and X. Chen. Servimage: An image generation and editing benchmark from real-world commercial imaging services, 2026

  10. [10]

    F. Ji, J. Yang, Z. Song, Y . Wang, Z. Cui, Y . Li, Q. Jiang, and X. Chen. Finestate-bench: Benchmarking state-conditioned grounding for fine-grained gui state setting, 2026

  11. [11]

    F. Ji, J. Yang, Z. Song, Y . Wang, Z. Cui, Y . Li, Q. Jiang, M. Fang, and X. Chen. Finestate-bench: A comprehensive benchmark for fine-grained state control in gui agents, 2025

  12. [12]

    Kirstain, A

    Y . Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

  13. [13]

    W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  14. [14]

    Z. Lai, Y . Zheng, Z. Cai, H. Lyu, J. Yang, H. Liang, Y . Hu, and B. Wang. Can multimodal llms see materials clearly? a multimodal benchmark on materials characterization.arXiv preprint arXiv:2509.09307, 2025. 10

  15. [15]

    Lipman, R

    Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023

  16. [16]

    R. Liu, D. Garrette, C. Saharia, W. Chan, A. Roberts, S. Narang, I. Blok, R. Mical, M. Norouzi, and N. Constant. Character-aware models improve visual text rendering. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16270–16297, 2023

  17. [17]

    X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations, 2023

  18. [18]

    Z. Liu, W. Liang, Z. Liang, C. Luo, J. Li, G. Huang, and Y . Yuan. Glyph-byt5: A customized text encoder for accurate visual text rendering. InEuropean Conference on Computer Vision, pages 361–377. Springer, 2024

  19. [19]

    J. Ma, M. Zhao, C. Chen, R. Wang, D. Niu, H. Lu, and X. Lin. Glyphdraw: Seamlessly rendering text with intricate spatial structures in text-to-image generation.arXiv preprint arXiv:2303.17870, 2023

  20. [20]

    Y . Ma, H. Liu, H. Wang, H. Pan, Y . He, J. Yuan, A. Zeng, C. Cai, H.-Y . Shum, W. Liu, et al. Follow-your- emoji: Fine-controllable and expressive freestyle portrait animation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–12, 2024

  21. [21]

    Y . Ma, X. Wu, K. Chen, F. Zhu, R. Zhao, and H. Li. HPSv3: Towards wide-spectrum human preference score.arXiv preprint arXiv:2508.03789, 2025

  22. [22]

    X. Meng, S. Huang, J. Yang, M. Ma, Z. Ma, L. Han, G. Yuan, H. Li, and L. Cheng. From reach to insert: Tactile-augmented precision assembly under sub-millimeter tolerances, 2026

  23. [23]

    Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

    PaddlePaddle Team. PaddleOCR-VL: Boosting general document parsing via a 0.9B ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

  24. [24]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

  25. [25]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, 2023

  26. [26]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  27. [27]

    Schuhmann

    C. Schuhmann. LAION-Aesthetics: A linear aesthetic quality predictor on top of CLIP embeddings. https://github.com/christophschuhmann/improved-aesthetic-predictor, 2022

  28. [28]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  29. [29]

    W. Shi, Y . Song, D. Zhang, J. Liu, and X. Zou. Fonts: Text rendering with typography and style controls. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18463–18474, 2025

  30. [30]

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative mod- eling through stochastic differential equations. InInternational Conference on Learning Representations, 2021

  31. [31]

    Z. Song, J. Yang, Y . Huang, J. Tonglet, Z. Zhang, T. Cheng, M. Fang, I. Gurevych, and X. Chen. Geolocation with real human gameplay data: A large-scale dataset and human-like reasoning framework, 2026

  32. [32]

    Staniszewski, B

    Ł. Staniszewski, B. Cywi ´nski, F. Boenisch, K. Deja, and A. Dziedzic. Precise parameter localization for textual generation in diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025

  33. [33]

    D. Tang, Q. Jiang, J. Yang, J. Zhao, X. Du, M. Fang, and X. Zhang. Sltp: A symbolic travel-planning agent framework with decoupled translation and heuristic tree search.Electronics, 15(2), 2026

  34. [34]

    Y . Tuo, W. Xiang, J.-Y . He, Y . Geng, and X. Xie. Anytext: Multilingual visual text generation and editing. InInternational Conference on Learning Representations, 2024. 11

  35. [35]

    Wallace, M

    B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  36. [36]

    Y . Wang, C. Han, Y . Li, Z. Jin, X. Li, S. Du, W. Tao, S. Li, Y . Yang, C. Yuan, et al. Uniglyph: Unified segmentation-conditioned diffusion for precise visual text synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18335–18344, 2025

  37. [37]

    Y . Wang, W. Zhang, H. Xu, and C. Jin. Dreamtext: High fidelity scene text synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28555–28563, 2025

  38. [38]

    Z. Wang, J. Bao, S. Gu, D. Chen, W. Zhou, and H. Li. Designdiffusion: High-quality text-to-design image generation with diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 20906–20915, 2025

  39. [39]

    C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S.-m. Yin, S. Bai, X. Xu, Y . Chen, Y . Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, D. Liu, D. Li, H. Zhang, H. Meng, H. Wei, J. Ni, K. Chen, K. Cao, L. Peng, L. Qu, M. Wu, P. Wang, S. Yu, T. Wen, W. Feng, X. Xu, Y . Wang, Y . Zhang, Y . Zhu, Y . Wu, Y . Cai, and Z. Liu. Qwen-image technical r...

  40. [40]

    J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

  41. [41]

    Z. Xue, J. Wu, Y . Gao, F. Kong, L. Zhu, M. Chen, Z. Liu, W. Liu, Q. Guo, W. Huang, and P. Luo. DanceGRPO: Unleashing GRPO on visual generation.arXiv preprint arXiv:2505.07818, 2025

  42. [42]

    Z. Yan, J. Wang, A. Wang, Y . Li, W. Shang, and Z. Hangcheng. Textmaster: A unified framework for realistic text editing via glyph-style dual-control. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16112–16121, 2025

  43. [43]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  44. [44]

    J. Yang, M. Cui, H. Zhang, F. Ji, Z. Lai, and Y . Wang. Agent-based anti-jamming techniques for uav communications in adversarial environments: A comprehensive survey, 2025

  45. [45]

    J. Yang, Z. Han, M. Xiang, H. Wang, Y . Huang, and M. Fang. Asynchronous and segmented bidirectional encoding for NMT.CoRR, abs/2402.14849, 2024

  46. [46]

    J. Yang, H. Wang, Q. Zhao, Z. Shi, Z. Song, and M. Fang. Efficient reinforcement learning via decoupling exploration and utilization. InInternational Conference on Intelligent Computing, pages 396–406. Springer, 2024

  47. [47]

    J. Yang, H. Zhang, F. Ji, Y . Wang, M. Wang, Y . Luo, and W. Ding. Frequency point game environment for uavs via expert knowledge and large language model.Drones, 10(2), 2026

  48. [48]

    Y . Yang, D. Gui, Y . Yuan, W. Liang, H. Ding, H. Hu, and K. Chen. Glyphcontrol: Glyph conditional control for visual text generation.Advances in Neural Information Processing Systems, 36:44050–44066, 2023

  49. [49]

    J. Yuan, X. Zhang, H. Zhou, J. Wang, Z. Qiu, Z. Shao, S. Zhang, S. Long, K. Kuang, K. Yao, et al. Hap: Structure-aware masked image modeling for human-centric perception.Advances in Neural Information Processing Systems, 36:50597–50616, 2023

  50. [50]

    Z-Image Team, H. Cai, S. Cao, R. Du, P. Gao, S. Hoi, Z. Hou, S. Huang, D. Jiang, X. Jin, L. Li, Z. Li, Z.-Y . Li, D. Liu, D. Liu, J. Shi, Q. Wu, F. Yu, C. Zhang, S. Zhang, and S. Zhou. Z-Image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

  51. [51]

    Zhang, Z

    B. Zhang, Z. Gao, Y . Qu, and H. Xie. How control information influences multilingual text image generation and editing?Advances in Neural Information Processing Systems, 37:6884–6904, 2024

  52. [52]

    Zhang, X

    L. Zhang, X. Chen, Y . Wang, Y . Lu, and Y . Qiao. Brush your text: Synthesize any scene text on images via diffusion model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7215–7223, 2024

  53. [53]

    Zhao and Z

    Y . Zhao and Z. Lian. Udifftext: A unified framework for high-quality text synthesis in arbitrary images via character-aware diffusion models. InEuropean conference on computer vision, pages 217–233. Springer, 2024. 12

  54. [54]

    displaying the text

    Y . Zhu, J. Liu, F. Gao, W. Liu, X. Wang, P. Wang, F. Huang, C. Yao, and Z. Yang. Visual text generation in the wild. InEuropean Conference on Computer Vision, pages 89–106. Springer, 2024. A Details of the Text Rendering Benchmark Construction This section provides the full construction details summarized in Sec. 4.1. The pipeline is run independently pe...