TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

Fajri Koto; Fengxian Ji; Jiaming Wang; Jingpu Yang; Mingxuan Cui; Qian Jiang; Xiuying Chen; Zhecheng Shi; Zirui Song

REVIEW 3 major objections 2 minor 1 cited by

A hierarchical VLM reward converts binary defect judgments into scalar signals that align text-to-image models for accurate rendering via GRPO or DPO.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-30 18:48 UTC pith:NKMPEODN

load-bearing objection TextAlign shows a post-training hierarchical VLM reward can lift text rendering on FLUX and Z-Image-Turbo, but the conversion from VLM binary labels to scalar preference lacks any reported human validation. the 3 major comments →

arxiv 2605.19320 v2 pith:NKMPEODN submitted 2026-05-19 cs.CV cs.DB

TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

Mingxuan Cui , Jingpu Yang , Fengxian Ji , Qian Jiang , Zhecheng Shi , Jiaming Wang , Zirui Song , Fajri Koto

show 1 more author

Xiuying Chen

This is my paper

classification cs.CV cs.DB

keywords text renderingpreference alignmenthierarchical rewardtext-to-imageVLMGRPODPOOCR accuracy

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames text rendering as a post-training preference-alignment task instead of an architecture redesign problem. It builds TextAlign around a vision-language model that checks errors at global, word, and glyph levels and turns those checks into a single preference score. The score then drives Group Relative Policy Optimization or Direct Preference Optimization on the unchanged generator. Tests on FLUX.1-dev and Z-Image-Turbo produce higher OCR accuracy while preserving general image quality. The results position reward design as a model-agnostic way to fix text rendering across foundation models.

Core claim

TextAlign keeps the generator architecture fixed and instead supplies a hierarchical VLM-based reward that decomposes text-rendering quality into global, word, and glyph levels, converts the binary defect judgments into a scalar preference signal, and feeds that signal to either GRPO or DPO optimization.

What carries the argument

Hierarchical VLM-based reward that decomposes rendering errors into global, word, and glyph levels then converts the binary judgments into a scalar preference signal for GRPO or DPO.

Load-bearing premise

The VLM's binary defect judgments at the three levels can be turned into a reliable scalar preference signal that supports effective optimization.

What would settle it

Training FLUX.1-dev with the hierarchical reward and then measuring OCR accuracy on a fixed prompt set yields no gain over the unaligned baseline, or general image quality drops.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

OCR text accuracy rises on both FLUX.1-dev and Z-Image-Turbo.
General generation quality stays comparable to the original models.
The same reward works with both GRPO and DPO.
The method outperforms several strong baselines including SD3.5, Qwen-Image, AnyText, and TextDiffuser.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three-level judgment structure could be reused for other fine-grained visual control tasks.
If the reward proves robust, specialized text encoders may become unnecessary for many applications.
Extending the hierarchy to video or layout-constrained generation is a direct next test.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

TextAlign shows a post-training hierarchical VLM reward can lift text rendering on FLUX and Z-Image-Turbo, but the conversion from VLM binary labels to scalar preference lacks any reported human validation.

read the letter

The main point is that this paper treats text rendering as a preference-alignment task rather than an architecture problem. It introduces a hierarchical VLM reward that scores at global, word, and glyph levels, turns the binary defect calls into a scalar signal, and runs GRPO or DPO on FLUX.1-dev and Z-Image-Turbo. The reported outcome is higher OCR accuracy with no drop in general image quality, and it beats several text-specific baselines.

What the work does cleanly is keep the generator unchanged. That makes the method portable across foundation models, which is the practical advantage over adding encoder modules or special text layers. The three-level decomposition is a sensible way to handle the different granularities of text errors.

The soft spot is exactly where the stress-test flagged it. The whole pipeline rests on the VLM producing reliable binary judgments that can be aggregated into a preference signal. The abstract gives no numbers on how those VLM outputs align with human glyph-level labels, no inter-annotator agreement, and no ablation of the aggregation rule. If the VLM is systematically off on fine text details, the preference pairs fed to GRPO and DPO will be noisy, and the claimed gains become harder to trust. The experiments compare against SD3.5, Qwen-Image, AnyText, and TextDiffuser, but without that validation step the central claim that reward design alone is enough stays under-supported.

This is for people who work on post-training fixes for text-to-image models and want something that does not require retraining the base weights. A reader who cares about alignment techniques or practical deployment would find the reward decomposition worth looking at. The paper is coherent on its own terms and engages the literature, so it deserves a serious referee even though the VLM validation needs to be addressed.

Referee Report

3 major / 2 minor

Summary. The paper proposes TextAlign, a non-invasive post-training framework for improving text rendering in text-to-image models. It introduces a hierarchical VLM-based reward that decomposes errors into global, word, and glyph levels, converts binary defect judgments into scalar preferences, and applies these to GRPO and DPO optimization on models such as FLUX.1-dev and Z-Image-Turbo. Experiments claim consistent OCR accuracy gains without degrading general generation quality, outperforming baselines including SD3.5, Qwen-Image, AnyText, and TextDiffuser, positioning reward design as a scalable alternative to architectural changes.

Significance. If the results hold after addressing validation gaps, the work would be significant as it provides empirical evidence that preference alignment via carefully designed hierarchical rewards can improve a persistent weakness in foundation models without modifying their architecture. This could enable broader deployment across existing generators and contribute to the growing literature on reward modeling for generative tasks.

major comments (3)

[Abstract] Abstract and Experiments section: The central claim of consistent gains in OCR-based text accuracy relies on the hierarchical VLM reward producing reliable preference signals, yet no quantitative validation (e.g., precision/recall of VLM glyph-level judgments vs. human labels or inter-annotator agreement) is reported. This directly affects whether the scalar preference signal supports effective GRPO/DPO as asserted.
[Reward Model] Reward construction (hierarchical VLM section): The conversion rule from three-level binary defect judgments to scalar preference is load-bearing for the optimization claims, but the manuscript supplies no ablation of the aggregation function or evidence that VLM judgments avoid systematic biases on fine-grained text, which is a known issue with current VLMs.
[Experiments] Experiments and baselines: Comparisons to AnyText and TextDiffuser require explicit details on implementation, prompt sets, and statistical significance testing of the reported OCR gains; without these, it is unclear whether the improvements are attributable to the proposed reward or to differences in evaluation protocol.

minor comments (2)

[Methods] Clarify the exact mathematical form of the scalar reward aggregation from the three binary levels in the methods section.
[Results] Add error bars or multiple-run statistics to the OCR accuracy tables to support the 'consistent gains' claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger validation of the reward model, ablations on aggregation, and clearer experimental protocols. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and Experiments section: The central claim of consistent gains in OCR-based text accuracy relies on the hierarchical VLM reward producing reliable preference signals, yet no quantitative validation (e.g., precision/recall of VLM glyph-level judgments vs. human labels or inter-annotator agreement) is reported. This directly affects whether the scalar preference signal supports effective GRPO/DPO as asserted.

Authors: We agree that explicit quantitative validation of the VLM judgments is a gap in the current manuscript. While end-to-end OCR gains provide supporting evidence, we will add a new validation subsection in the revised Experiments section. This will include a human study on a sampled set of generations, reporting precision/recall of VLM glyph-level judgments against human labels and inter-annotator agreement statistics. revision: yes
Referee: [Reward Model] Reward construction (hierarchical VLM section): The conversion rule from three-level binary defect judgments to scalar preference is load-bearing for the optimization claims, but the manuscript supplies no ablation of the aggregation function or evidence that VLM judgments avoid systematic biases on fine-grained text, which is a known issue with current VLMs.

Authors: We will add an ablation study in the revised Reward Model section comparing alternative aggregation functions for converting the three-level binary judgments into scalar preferences. We will also include qualitative and quantitative analysis of potential VLM biases on fine-grained text, with examples contrasting VLM outputs against human judgments to address known limitations. revision: yes
Referee: [Experiments] Experiments and baselines: Comparisons to AnyText and TextDiffuser require explicit details on implementation, prompt sets, and statistical significance testing of the reported OCR gains; without these, it is unclear whether the improvements are attributable to the proposed reward or to differences in evaluation protocol.

Authors: We will expand the Experiments section with full implementation details for AnyText and TextDiffuser (including code references or hyperparameters), the exact prompt sets and evaluation protocols, and statistical significance testing (e.g., paired t-tests) on the OCR accuracy improvements to confirm the gains are robust and not due to protocol differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core proposal is a hierarchical VLM-based reward that decomposes rendering errors into global/word/glyph levels and converts binary judgments into a scalar preference signal for GRPO/DPO. This relies on external VLM outputs and empirical validation on FLUX.1-dev and Z-Image-Turbo against listed baselines; no equations, self-citations, or fitted parameters are shown that reduce the claimed gains to the inputs by construction. The derivation remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper's contribution centers on the design of the reward function rather than introducing new free parameters or axioms; details limited to abstract.

invented entities (1)

hierarchical VLM-based reward model no independent evidence
purpose: Decomposes rendering errors into global, word, and glyph levels and converts binary judgments into scalar preference signal
Core innovation described in abstract; no external validation or independent evidence provided

pith-pipeline@v0.9.1-grok · 5742 in / 1127 out tokens · 39720 ms · 2026-06-30T18:48:49.655376+00:00 · methodology

0 comments

read the original abstract

Faithful text rendering remains a persistent weakness of large text-to-image generative models, as it requires both semantic instruction following and fine-grained glyph-level structure. Prior methods often improve this ability through architecture-specific modules or encoder modifications, which complicate deployment across foundation models. We study text rendering as a post-training preference-alignment problem and propose TextAlign, a non-invasive framework that keeps the generator architecture unchanged. The key component is a hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal. The resulting signal supports both Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO). Experiments on FLUX.1-dev and Z-Image-Turbo show consistent gains in OCR-based text accuracy without degrading general generation quality. Compared with strong foundation and text-rendering baselines, including SD3.5, Qwen-Image, AnyText, and TextDiffuser, these results indicate that reward design offers a scalable alternative to model redesign for improving text rendering.

Figures

Figures reproduced from arXiv: 2605.19320 by Fajri Koto, Fengxian Ji, Jiaming Wang, Jingpu Yang, Mingxuan Cui, Qian Jiang, Xiuying Chen, Zhecheng Shi, Zirui Song.

**Figure 1.** Figure 1: Text rendering results. Representative 720 × 720 samples generated by our aligned models. TextAlign renders legible and well-formed visual text across diverse carriers, styles, layouts, and text lengths while preserving coherent image content. model can require non-trivial engineering and may disturb the pretrained generative prior that gives modern models their broad visual competence. We take a different… view at source ↗

**Figure 2.** Figure 2: Our hierarchical reward mechanism. Given a generated image x and reference text y, three independent VLM calls produce binary indicators at the global, word and glyph levels, which are aggregated into a scalar reward R that drives either GRPO or DPO. model’s qualitative judgement into parsable signals. Let Nv ≤ N denote the number of indicators successfully parsed for a given sample. We define the scalar r… view at source ↗

**Figure 3.** Figure 3: User study. Human preference votes on text fidelity and visual integration. Our GRPO-aligned models outperform prior baselines and base generators on both criteria, with Z-Image (Our GRPO) preferred most. 4.4 Evaluation on External Dataset To test whether the gains from TextAlign transfer beyond our constructed benchmark, we further evaluate the same models on a 500-sample split of the external MARIO-Eval… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of text rendering results. Given the same prompts, GRPOaligned FLUX and Z-Image produce more faithful and legible visual text while preserving the surrounding visual context. F1-score, although some ablated variants slightly improve a single metric such as NED or strict accuracy. Overall, the three reward levels are complementary: global feedback stabilizes readable text structure, … view at source ↗

**Figure 5.** Figure 5: Robustness to text length and spatial placement. Radar visualizations of FLUX (Our GRPO) and Z-Image-Turbo (Our GRPO) across text-length and position subsets. Academic Advertisement Artistic Basic Cover Handwriting Logo Poster Scene Sticker [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results across visual categories. Z-Image (Our GRPO) renders legible text across diverse visual text scenarios while preserving category-specific style and layout. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Parametric Memory Decoding for Zero-Shot Routing in LoRA-Based External Parametric Memory
cs.LG 2026-07 conditional novelty 6.0

PMDRouter selects LoRAs zero-shot by decoding scale-normalized linear response energy from one adapter-free backbone prefill, and leads most internal-signal baselines on a new multi-granularity EPM bench.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Black Forest Labs. FLUX.1. https://blackforestlabs.ai/, 2024. Text-to-image model suite and release documentation

work page 2024
[2]

J. Chen, Y . Huang, T. Lv, L. Cui, Q. Chen, and F. Wei. Textdiffuser: Diffusion models as text painters. In Advances in Neural Information Processing Systems, 2023

work page 2023
[3]

J. Chen, Y . Huang, T. Lv, L. Cui, Q. Chen, and F. Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024

work page 2024
[4]

Q. Chen, Y . Ma, H. Wang, J. Yuan, W. Zhao, Q. Tian, H. Wang, S. Min, Q. Chen, and W. Liu. Infinite- canvas: Higher-resolution video outpainting with extensive content generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 2150–2158, 2025

work page 2025
[5]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y . Marek, and R. Rombach. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Gao, J.-Y

L. Gao, J.-Y . He, Y . Zeng, Y . Zhong, X. Sun, J. Hu, Z. Gao, and X. Wei. Vitype: High-fidelity visual text rendering via glyph-aware multimodal diffusion. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4131–4139, 2026

work page 2026
[7]

Hessel, A

J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y . Choi. CLIPScore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021

work page 2021
[8]

X. Hu, K. Xu, B. Liu, Q. Liu, and H. Fei. Amo sampler: Enhancing text rendering with overshooting. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13157–13166, 2025

work page 2025
[9]

F. Ji, J. Yang, Z. Song, L. Gao, J. Liang, Z. Chen, J. Zhang, and X. Chen. Servimage: An image generation and editing benchmark from real-world commercial imaging services, 2026

work page 2026
[10]

F. Ji, J. Yang, Z. Song, Y . Wang, Z. Cui, Y . Li, Q. Jiang, and X. Chen. Finestate-bench: Benchmarking state-conditioned grounding for fine-grained gui state setting, 2026

work page 2026
[11]

F. Ji, J. Yang, Z. Song, Y . Wang, Z. Cui, Y . Li, Q. Jiang, M. Fang, and X. Chen. Finestate-bench: A comprehensive benchmark for fine-grained state control in gui agents, 2025

work page 2025
[12]

Kirstain, A

Y . Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

work page 2023
[13]

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Z. Lai, Y . Zheng, Z. Cai, H. Lyu, J. Yang, H. Liang, Y . Hu, and B. Wang. Can multimodal llms see materials clearly? a multimodal benchmark on materials characterization.arXiv preprint arXiv:2509.09307, 2025. 10

work page arXiv 2025
[15]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023

work page 2023
[16]

R. Liu, D. Garrette, C. Saharia, W. Chan, A. Roberts, S. Narang, I. Blok, R. Mical, M. Norouzi, and N. Constant. Character-aware models improve visual text rendering. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16270–16297, 2023

work page 2023
[17]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations, 2023

work page 2023
[18]

Z. Liu, W. Liang, Z. Liang, C. Luo, J. Li, G. Huang, and Y . Yuan. Glyph-byt5: A customized text encoder for accurate visual text rendering. InEuropean Conference on Computer Vision, pages 361–377. Springer, 2024

work page 2024
[19]

J. Ma, M. Zhao, C. Chen, R. Wang, D. Niu, H. Lu, and X. Lin. Glyphdraw: Seamlessly rendering text with intricate spatial structures in text-to-image generation.arXiv preprint arXiv:2303.17870, 2023

work page arXiv 2023
[20]

Y . Ma, H. Liu, H. Wang, H. Pan, Y . He, J. Yuan, A. Zeng, C. Cai, H.-Y . Shum, W. Liu, et al. Follow-your- emoji: Fine-controllable and expressive freestyle portrait animation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–12, 2024

work page 2024
[21]

Y . Ma, X. Wu, K. Chen, F. Zhu, R. Zhao, and H. Li. HPSv3: Towards wide-spectrum human preference score.arXiv preprint arXiv:2508.03789, 2025

work page arXiv 2025
[22]

X. Meng, S. Huang, J. Yang, M. Ma, Z. Ma, L. Han, G. Yuan, H. Li, and L. Cheng. From reach to insert: Tactile-augmented precision assembly under sub-millimeter tolerances, 2026

work page 2026
[23]

Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

PaddlePaddle Team. PaddleOCR-VL: Boosting general document parsing via a 0.9B ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

work page arXiv 2025
[24]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023
[25]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[26]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022
[27]

Schuhmann

C. Schuhmann. LAION-Aesthetics: A linear aesthetic quality predictor on top of CLIP embeddings. https://github.com/christophschuhmann/improved-aesthetic-predictor, 2022

work page 2022
[28]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

W. Shi, Y . Song, D. Zhang, J. Liu, and X. Zou. Fonts: Text rendering with typography and style controls. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18463–18474, 2025

work page 2025
[30]

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative mod- eling through stochastic differential equations. InInternational Conference on Learning Representations, 2021

work page 2021
[31]

Z. Song, J. Yang, Y . Huang, J. Tonglet, Z. Zhang, T. Cheng, M. Fang, I. Gurevych, and X. Chen. Geolocation with real human gameplay data: A large-scale dataset and human-like reasoning framework, 2026

work page 2026
[32]

Staniszewski, B

Ł. Staniszewski, B. Cywi ´nski, F. Boenisch, K. Deja, and A. Dziedzic. Precise parameter localization for textual generation in diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[33]

D. Tang, Q. Jiang, J. Yang, J. Zhao, X. Du, M. Fang, and X. Zhang. Sltp: A symbolic travel-planning agent framework with decoupled translation and heuristic tree search.Electronics, 15(2), 2026

work page 2026
[34]

Y . Tuo, W. Xiang, J.-Y . He, Y . Geng, and X. Xie. Anytext: Multilingual visual text generation and editing. InInternational Conference on Learning Representations, 2024. 11

work page 2024
[35]

Wallace, M

B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[36]

Y . Wang, C. Han, Y . Li, Z. Jin, X. Li, S. Du, W. Tao, S. Li, Y . Yang, C. Yuan, et al. Uniglyph: Unified segmentation-conditioned diffusion for precise visual text synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18335–18344, 2025

work page 2025
[37]

Y . Wang, W. Zhang, H. Xu, and C. Jin. Dreamtext: High fidelity scene text synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28555–28563, 2025

work page 2025
[38]

Z. Wang, J. Bao, S. Gu, D. Chen, W. Zhou, and H. Li. Designdiffusion: High-quality text-to-design image generation with diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 20906–20915, 2025

work page 2025
[39]

C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S.-m. Yin, S. Bai, X. Xu, Y . Chen, Y . Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, D. Liu, D. Li, H. Zhang, H. Meng, H. Wei, J. Ni, K. Chen, K. Cao, L. Peng, L. Qu, M. Wu, P. Wang, S. Yu, T. Wen, W. Feng, X. Xu, Y . Wang, Y . Zhang, Y . Zhu, Y . Wu, Y . Cai, and Z. Liu. Qwen-image technical r...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

work page 2023
[41]

Z. Xue, J. Wu, Y . Gao, F. Kong, L. Zhu, M. Chen, Z. Liu, W. Liu, Q. Guo, W. Huang, and P. Luo. DanceGRPO: Unleashing GRPO on visual generation.arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Z. Yan, J. Wang, A. Wang, Y . Li, W. Shang, and Z. Hangcheng. Textmaster: A unified framework for realistic text editing via glyph-style dual-control. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16112–16121, 2025

work page 2025
[43]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

J. Yang, M. Cui, H. Zhang, F. Ji, Z. Lai, and Y . Wang. Agent-based anti-jamming techniques for uav communications in adversarial environments: A comprehensive survey, 2025

work page 2025
[45]

J. Yang, Z. Han, M. Xiang, H. Wang, Y . Huang, and M. Fang. Asynchronous and segmented bidirectional encoding for NMT.CoRR, abs/2402.14849, 2024

work page arXiv 2024
[46]

J. Yang, H. Wang, Q. Zhao, Z. Shi, Z. Song, and M. Fang. Efficient reinforcement learning via decoupling exploration and utilization. InInternational Conference on Intelligent Computing, pages 396–406. Springer, 2024

work page 2024
[47]

J. Yang, H. Zhang, F. Ji, Y . Wang, M. Wang, Y . Luo, and W. Ding. Frequency point game environment for uavs via expert knowledge and large language model.Drones, 10(2), 2026

work page 2026
[48]

Y . Yang, D. Gui, Y . Yuan, W. Liang, H. Ding, H. Hu, and K. Chen. Glyphcontrol: Glyph conditional control for visual text generation.Advances in Neural Information Processing Systems, 36:44050–44066, 2023

work page 2023
[49]

J. Yuan, X. Zhang, H. Zhou, J. Wang, Z. Qiu, Z. Shao, S. Zhang, S. Long, K. Kuang, K. Yao, et al. Hap: Structure-aware masked image modeling for human-centric perception.Advances in Neural Information Processing Systems, 36:50597–50616, 2023

work page 2023
[50]

Z-Image Team, H. Cai, S. Cao, R. Du, P. Gao, S. Hoi, Z. Hou, S. Huang, D. Jiang, X. Jin, L. Li, Z. Li, Z.-Y . Li, D. Liu, D. Liu, J. Shi, Q. Wu, F. Yu, C. Zhang, S. Zhang, and S. Zhou. Z-Image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Zhang, Z

B. Zhang, Z. Gao, Y . Qu, and H. Xie. How control information influences multilingual text image generation and editing?Advances in Neural Information Processing Systems, 37:6884–6904, 2024

work page 2024
[52]

Zhang, X

L. Zhang, X. Chen, Y . Wang, Y . Lu, and Y . Qiao. Brush your text: Synthesize any scene text on images via diffusion model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7215–7223, 2024

work page 2024
[53]

Zhao and Z

Y . Zhao and Z. Lian. Udifftext: A unified framework for high-quality text synthesis in arbitrary images via character-aware diffusion models. InEuropean conference on computer vision, pages 217–233. Springer, 2024. 12

work page 2024
[54]

displaying the text

Y . Zhu, J. Liu, F. Gao, W. Liu, X. Wang, P. Wang, F. Huang, C. Yao, and Z. Yang. Visual text generation in the wild. InEuropean Conference on Computer Vision, pages 89–106. Springer, 2024. A Details of the Text Rendering Benchmark Construction This section provides the full construction details summarized in Sec. 4.1. The pipeline is run independently pe...

work page 2024

[1] [1]

Black Forest Labs. FLUX.1. https://blackforestlabs.ai/, 2024. Text-to-image model suite and release documentation

work page 2024

[2] [2]

J. Chen, Y . Huang, T. Lv, L. Cui, Q. Chen, and F. Wei. Textdiffuser: Diffusion models as text painters. In Advances in Neural Information Processing Systems, 2023

work page 2023

[3] [3]

J. Chen, Y . Huang, T. Lv, L. Cui, Q. Chen, and F. Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024

work page 2024

[4] [4]

Q. Chen, Y . Ma, H. Wang, J. Yuan, W. Zhao, Q. Tian, H. Wang, S. Min, Q. Chen, and W. Liu. Infinite- canvas: Higher-resolution video outpainting with extensive content generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 2150–2158, 2025

work page 2025

[5] [5]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y . Marek, and R. Rombach. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Gao, J.-Y

L. Gao, J.-Y . He, Y . Zeng, Y . Zhong, X. Sun, J. Hu, Z. Gao, and X. Wei. Vitype: High-fidelity visual text rendering via glyph-aware multimodal diffusion. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4131–4139, 2026

work page 2026

[7] [7]

Hessel, A

J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y . Choi. CLIPScore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021

work page 2021

[8] [8]

X. Hu, K. Xu, B. Liu, Q. Liu, and H. Fei. Amo sampler: Enhancing text rendering with overshooting. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13157–13166, 2025

work page 2025

[9] [9]

F. Ji, J. Yang, Z. Song, L. Gao, J. Liang, Z. Chen, J. Zhang, and X. Chen. Servimage: An image generation and editing benchmark from real-world commercial imaging services, 2026

work page 2026

[10] [10]

F. Ji, J. Yang, Z. Song, Y . Wang, Z. Cui, Y . Li, Q. Jiang, and X. Chen. Finestate-bench: Benchmarking state-conditioned grounding for fine-grained gui state setting, 2026

work page 2026

[11] [11]

F. Ji, J. Yang, Z. Song, Y . Wang, Z. Cui, Y . Li, Q. Jiang, M. Fang, and X. Chen. Finestate-bench: A comprehensive benchmark for fine-grained state control in gui agents, 2025

work page 2025

[12] [12]

Kirstain, A

Y . Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

work page 2023

[13] [13]

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Z. Lai, Y . Zheng, Z. Cai, H. Lyu, J. Yang, H. Liang, Y . Hu, and B. Wang. Can multimodal llms see materials clearly? a multimodal benchmark on materials characterization.arXiv preprint arXiv:2509.09307, 2025. 10

work page arXiv 2025

[15] [15]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023

work page 2023

[16] [16]

R. Liu, D. Garrette, C. Saharia, W. Chan, A. Roberts, S. Narang, I. Blok, R. Mical, M. Norouzi, and N. Constant. Character-aware models improve visual text rendering. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16270–16297, 2023

work page 2023

[17] [17]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations, 2023

work page 2023

[18] [18]

Z. Liu, W. Liang, Z. Liang, C. Luo, J. Li, G. Huang, and Y . Yuan. Glyph-byt5: A customized text encoder for accurate visual text rendering. InEuropean Conference on Computer Vision, pages 361–377. Springer, 2024

work page 2024

[19] [19]

J. Ma, M. Zhao, C. Chen, R. Wang, D. Niu, H. Lu, and X. Lin. Glyphdraw: Seamlessly rendering text with intricate spatial structures in text-to-image generation.arXiv preprint arXiv:2303.17870, 2023

work page arXiv 2023

[20] [20]

Y . Ma, H. Liu, H. Wang, H. Pan, Y . He, J. Yuan, A. Zeng, C. Cai, H.-Y . Shum, W. Liu, et al. Follow-your- emoji: Fine-controllable and expressive freestyle portrait animation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–12, 2024

work page 2024

[21] [21]

Y . Ma, X. Wu, K. Chen, F. Zhu, R. Zhao, and H. Li. HPSv3: Towards wide-spectrum human preference score.arXiv preprint arXiv:2508.03789, 2025

work page arXiv 2025

[22] [22]

X. Meng, S. Huang, J. Yang, M. Ma, Z. Ma, L. Han, G. Yuan, H. Li, and L. Cheng. From reach to insert: Tactile-augmented precision assembly under sub-millimeter tolerances, 2026

work page 2026

[23] [23]

Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

PaddlePaddle Team. PaddleOCR-VL: Boosting general document parsing via a 0.9B ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

work page arXiv 2025

[24] [24]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023

[25] [25]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[26] [26]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022

[27] [27]

Schuhmann

C. Schuhmann. LAION-Aesthetics: A linear aesthetic quality predictor on top of CLIP embeddings. https://github.com/christophschuhmann/improved-aesthetic-predictor, 2022

work page 2022

[28] [28]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

W. Shi, Y . Song, D. Zhang, J. Liu, and X. Zou. Fonts: Text rendering with typography and style controls. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18463–18474, 2025

work page 2025

[30] [30]

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative mod- eling through stochastic differential equations. InInternational Conference on Learning Representations, 2021

work page 2021

[31] [31]

Z. Song, J. Yang, Y . Huang, J. Tonglet, Z. Zhang, T. Cheng, M. Fang, I. Gurevych, and X. Chen. Geolocation with real human gameplay data: A large-scale dataset and human-like reasoning framework, 2026

work page 2026

[32] [32]

Staniszewski, B

Ł. Staniszewski, B. Cywi ´nski, F. Boenisch, K. Deja, and A. Dziedzic. Precise parameter localization for textual generation in diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[33] [33]

D. Tang, Q. Jiang, J. Yang, J. Zhao, X. Du, M. Fang, and X. Zhang. Sltp: A symbolic travel-planning agent framework with decoupled translation and heuristic tree search.Electronics, 15(2), 2026

work page 2026

[34] [34]

Y . Tuo, W. Xiang, J.-Y . He, Y . Geng, and X. Xie. Anytext: Multilingual visual text generation and editing. InInternational Conference on Learning Representations, 2024. 11

work page 2024

[35] [35]

Wallace, M

B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[36] [36]

Y . Wang, C. Han, Y . Li, Z. Jin, X. Li, S. Du, W. Tao, S. Li, Y . Yang, C. Yuan, et al. Uniglyph: Unified segmentation-conditioned diffusion for precise visual text synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18335–18344, 2025

work page 2025

[37] [37]

Y . Wang, W. Zhang, H. Xu, and C. Jin. Dreamtext: High fidelity scene text synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28555–28563, 2025

work page 2025

[38] [38]

Z. Wang, J. Bao, S. Gu, D. Chen, W. Zhou, and H. Li. Designdiffusion: High-quality text-to-design image generation with diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 20906–20915, 2025

work page 2025

[39] [39]

C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S.-m. Yin, S. Bai, X. Xu, Y . Chen, Y . Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, D. Liu, D. Li, H. Zhang, H. Meng, H. Wei, J. Ni, K. Chen, K. Cao, L. Peng, L. Qu, M. Wu, P. Wang, S. Yu, T. Wen, W. Feng, X. Xu, Y . Wang, Y . Zhang, Y . Zhu, Y . Wu, Y . Cai, and Z. Liu. Qwen-image technical r...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

work page 2023

[41] [41]

Z. Xue, J. Wu, Y . Gao, F. Kong, L. Zhu, M. Chen, Z. Liu, W. Liu, Q. Guo, W. Huang, and P. Luo. DanceGRPO: Unleashing GRPO on visual generation.arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Z. Yan, J. Wang, A. Wang, Y . Li, W. Shang, and Z. Hangcheng. Textmaster: A unified framework for realistic text editing via glyph-style dual-control. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16112–16121, 2025

work page 2025

[43] [43]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

J. Yang, M. Cui, H. Zhang, F. Ji, Z. Lai, and Y . Wang. Agent-based anti-jamming techniques for uav communications in adversarial environments: A comprehensive survey, 2025

work page 2025

[45] [45]

J. Yang, Z. Han, M. Xiang, H. Wang, Y . Huang, and M. Fang. Asynchronous and segmented bidirectional encoding for NMT.CoRR, abs/2402.14849, 2024

work page arXiv 2024

[46] [46]

J. Yang, H. Wang, Q. Zhao, Z. Shi, Z. Song, and M. Fang. Efficient reinforcement learning via decoupling exploration and utilization. InInternational Conference on Intelligent Computing, pages 396–406. Springer, 2024

work page 2024

[47] [47]

J. Yang, H. Zhang, F. Ji, Y . Wang, M. Wang, Y . Luo, and W. Ding. Frequency point game environment for uavs via expert knowledge and large language model.Drones, 10(2), 2026

work page 2026

[48] [48]

Y . Yang, D. Gui, Y . Yuan, W. Liang, H. Ding, H. Hu, and K. Chen. Glyphcontrol: Glyph conditional control for visual text generation.Advances in Neural Information Processing Systems, 36:44050–44066, 2023

work page 2023

[49] [49]

J. Yuan, X. Zhang, H. Zhou, J. Wang, Z. Qiu, Z. Shao, S. Zhang, S. Long, K. Kuang, K. Yao, et al. Hap: Structure-aware masked image modeling for human-centric perception.Advances in Neural Information Processing Systems, 36:50597–50616, 2023

work page 2023

[50] [50]

Z-Image Team, H. Cai, S. Cao, R. Du, P. Gao, S. Hoi, Z. Hou, S. Huang, D. Jiang, X. Jin, L. Li, Z. Li, Z.-Y . Li, D. Liu, D. Liu, J. Shi, Q. Wu, F. Yu, C. Zhang, S. Zhang, and S. Zhou. Z-Image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Zhang, Z

B. Zhang, Z. Gao, Y . Qu, and H. Xie. How control information influences multilingual text image generation and editing?Advances in Neural Information Processing Systems, 37:6884–6904, 2024

work page 2024

[52] [52]

Zhang, X

L. Zhang, X. Chen, Y . Wang, Y . Lu, and Y . Qiao. Brush your text: Synthesize any scene text on images via diffusion model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7215–7223, 2024

work page 2024

[53] [53]

Zhao and Z

Y . Zhao and Z. Lian. Udifftext: A unified framework for high-quality text synthesis in arbitrary images via character-aware diffusion models. InEuropean conference on computer vision, pages 217–233. Springer, 2024. 12

work page 2024

[54] [54]

displaying the text

Y . Zhu, J. Liu, F. Gao, W. Liu, X. Wang, P. Wang, F. Huang, C. Yao, and Z. Yang. Visual text generation in the wild. InEuropean Conference on Computer Vision, pages 89–106. Springer, 2024. A Details of the Text Rendering Benchmark Construction This section provides the full construction details summarized in Sec. 4.1. The pipeline is run independently pe...

work page 2024