TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards
Pith reviewed 2026-05-20 06:55 UTC · model grok-4.3
The pith
Text rendering in image generators improves by aligning preferences with a hierarchical VLM reward that judges errors at global, word, and glyph levels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Text rendering is studied as a post-training preference-alignment problem. A hierarchical VLM-based reward decomposes rendering errors into global, word, and glyph levels, converts binary defect judgments into a scalar preference signal, and supports both GRPO and DPO. This produces consistent gains in OCR-based text accuracy on FLUX.1-dev and Z-Image-Turbo without degrading general generation quality, outperforming baselines such as SD3.5, Qwen-Image, AnyText, and TextDiffuser.
What carries the argument
hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels then converts binary defect judgments into a scalar preference signal
If this is right
- OCR accuracy on text in generated images rises consistently on the tested foundation models.
- General image generation quality remains unchanged or is preserved.
- The same reward signal works for both GRPO and DPO optimization.
- The approach compares favorably to existing text-rendering methods that require architecture changes.
Where Pith is reading between the lines
- The hierarchical reward design could transfer to other fine-grained control tasks such as layout or style in generative models.
- Multi-scale error decomposition might help in related domains like video generation where text elements must remain legible across frames.
- Reward modeling focused on specific capabilities may reduce the need for full model retraining when scaling foundation systems.
Load-bearing premise
The hierarchical VLM-based reward accurately decomposes and judges rendering errors at global, word, and glyph levels to produce a reliable scalar preference signal that improves the generator.
What would settle it
Running the TextAlign process on a new text-to-image model and measuring no improvement in OCR accuracy on generated text or a measurable drop in general image quality compared with the unaligned base model.
Figures
read the original abstract
Faithful text rendering remains a persistent weakness of large text-to-image generative models, as it requires both semantic instruction following and fine-grained glyph-level structure. Prior methods often improve this ability through architecture-specific modules or encoder modifications, which complicate deployment across foundation models. We study text rendering as a post-training preference-alignment problem and propose TextAlign, a non-invasive framework that keeps the generator architecture unchanged. The key component is a hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal. The resulting signal supports both Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO). Experiments on FLUX.1-dev and Z-Image-Turbo show consistent gains in OCR-based text accuracy without degrading general generation quality. Compared with strong foundation and text-rendering baselines, including SD3.5, Qwen-Image, AnyText, and TextDiffuser, these results indicate that reward design offers a scalable alternative to model redesign for improving text rendering.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TextAlign, a non-invasive post-training framework for improving text rendering in text-to-image models via preference alignment. It employs a hierarchical VLM-based reward that decomposes rendering errors at global, word, and glyph levels, converting binary defect judgments into scalar preference signals usable with GRPO or DPO. Experiments on FLUX.1-dev and Z-Image-Turbo claim consistent OCR accuracy gains without degrading general generation quality, outperforming baselines such as SD3.5, Qwen-Image, AnyText, and TextDiffuser.
Significance. If the central empirical claims hold, the work demonstrates that reward design can serve as a scalable alternative to architecture-specific modifications for enhancing fine-grained text rendering in foundation models. This approach preserves model compatibility and could generalize across generators. The hierarchical decomposition targets the multi-scale nature of text errors, which is a strength if the VLM judgments prove reliable and stable.
major comments (2)
- [§3 (Method)] The central claim rests on the hierarchical VLM reward producing an accurate and stable preference signal. No validation of the VLM's binary defect judgments (e.g., inter-rater agreement with humans or error analysis at the glyph level) is provided, despite known limitations of VLMs on fine-grained visual reasoning; this directly affects whether the reported OCR gains on FLUX.1-dev and Z-Image-Turbo can be attributed to the method.
- [§4 (Experiments)] Experiments assert consistent OCR-based text accuracy gains without degrading quality and superiority to baselines, yet the available description provides no quantitative metrics, error bars, dataset sizes, ablation results, or statistical tests. This leaves the magnitude, reliability, and specificity of the improvements unverified.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., OCR accuracy delta) to support the empirical claims.
- [§3.2] Notation for the scalar preference signal derivation from binary judgments could be clarified with an explicit equation or pseudocode.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and describe the revisions that will be incorporated to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3 (Method)] The central claim rests on the hierarchical VLM reward producing an accurate and stable preference signal. No validation of the VLM's binary defect judgments (e.g., inter-rater agreement with humans or error analysis at the glyph level) is provided, despite known limitations of VLMs on fine-grained visual reasoning; this directly affects whether the reported OCR gains on FLUX.1-dev and Z-Image-Turbo can be attributed to the method.
Authors: We agree that explicit validation of the VLM judgments would strengthen attribution of the OCR gains to the hierarchical reward. The current manuscript supports the reward's utility through consistent downstream OCR improvements and outperformance versus baselines, but we acknowledge this is indirect evidence. We will add a dedicated validation subsection with human inter-rater agreement (Cohen's kappa) on a sampled set of global/word/glyph judgments and a glyph-level error breakdown comparing VLM decisions to human annotations. revision: yes
-
Referee: [§4 (Experiments)] Experiments assert consistent OCR-based text accuracy gains without degrading quality and superiority to baselines, yet the available description provides no quantitative metrics, error bars, dataset sizes, ablation results, or statistical tests. This leaves the magnitude, reliability, and specificity of the improvements unverified.
Authors: We appreciate the request for greater experimental transparency. The manuscript contains tables reporting OCR accuracy on FLUX.1-dev and Z-Image-Turbo together with baseline comparisons, but we agree the presentation can be expanded. In revision we will include per-experiment dataset sizes, standard deviations or error bars on OCR metrics, ablation results isolating each level of the hierarchical reward, and paired statistical tests (e.g., Wilcoxon signed-rank) to establish significance of the reported gains. revision: yes
Circularity Check
No circularity: reward signal derived from external VLM judgments
full rationale
The paper describes TextAlign as a post-training preference alignment method that applies a hierarchical VLM-based reward to decompose text rendering errors at global, word, and glyph levels, converting binary defect judgments into a scalar preference signal for GRPO and DPO. No equations, derivations, or self-referential definitions appear in the abstract or described framework that reduce the claimed OCR accuracy gains to fitted parameters or tautological inputs by construction. The reward originates from external VLM evaluations rather than internal fits or self-citations that bear the central load, and experiments on FLUX.1-dev and Z-Image-Turbo are presented as empirical validation against baselines. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A vision-language model can accurately judge text rendering defects at global, word, and glyph levels and convert these into a useful preference signal.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Black Forest Labs. FLUX.1. https://blackforestlabs.ai/, 2024. Text-to-image model suite and release documentation
work page 2024
-
[2]
J. Chen, Y . Huang, T. Lv, L. Cui, Q. Chen, and F. Wei. Textdiffuser: Diffusion models as text painters. In Advances in Neural Information Processing Systems, 2023
work page 2023
-
[3]
J. Chen, Y . Huang, T. Lv, L. Cui, Q. Chen, and F. Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024
work page 2024
-
[4]
Q. Chen, Y . Ma, H. Wang, J. Yuan, W. Zhao, Q. Tian, H. Wang, S. Min, Q. Chen, and W. Liu. Infinite- canvas: Higher-resolution video outpainting with extensive content generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 2150–2158, 2025
work page 2025
-
[5]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y . Marek, and R. Rombach. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [6]
- [7]
-
[8]
X. Hu, K. Xu, B. Liu, Q. Liu, and H. Fei. Amo sampler: Enhancing text rendering with overshooting. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13157–13166, 2025
work page 2025
-
[9]
F. Ji, J. Yang, Z. Song, L. Gao, J. Liang, Z. Chen, J. Zhang, and X. Chen. Servimage: An image generation and editing benchmark from real-world commercial imaging services, 2026
work page 2026
-
[10]
F. Ji, J. Yang, Z. Song, Y . Wang, Z. Cui, Y . Li, Q. Jiang, and X. Chen. Finestate-bench: Benchmarking state-conditioned grounding for fine-grained gui state setting, 2026
work page 2026
-
[11]
F. Ji, J. Yang, Z. Song, Y . Wang, Z. Cui, Y . Li, Q. Jiang, M. Fang, and X. Chen. Finestate-bench: A comprehensive benchmark for fine-grained state control in gui agents, 2025
work page 2025
-
[12]
Y . Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023
work page 2023
-
[13]
W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [14]
- [15]
-
[16]
R. Liu, D. Garrette, C. Saharia, W. Chan, A. Roberts, S. Narang, I. Blok, R. Mical, M. Norouzi, and N. Constant. Character-aware models improve visual text rendering. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16270–16297, 2023
work page 2023
-
[17]
X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations, 2023
work page 2023
-
[18]
Z. Liu, W. Liang, Z. Liang, C. Luo, J. Li, G. Huang, and Y . Yuan. Glyph-byt5: A customized text encoder for accurate visual text rendering. InEuropean Conference on Computer Vision, pages 361–377. Springer, 2024
work page 2024
- [19]
-
[20]
Y . Ma, H. Liu, H. Wang, H. Pan, Y . He, J. Yuan, A. Zeng, C. Cai, H.-Y . Shum, W. Liu, et al. Follow-your- emoji: Fine-controllable and expressive freestyle portrait animation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–12, 2024
work page 2024
- [21]
-
[22]
X. Meng, S. Huang, J. Yang, M. Ma, Z. Ma, L. Han, G. Yuan, H. Li, and L. Cheng. From reach to insert: Tactile-augmented precision assembly under sub-millimeter tolerances, 2026
work page 2026
-
[23]
PaddlePaddle Team. PaddleOCR-VL: Boosting general document parsing via a 0.9B ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025
-
[24]
W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023
work page 2023
-
[25]
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[26]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
work page 2022
- [27]
-
[28]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
W. Shi, Y . Song, D. Zhang, J. Liu, and X. Zou. Fonts: Text rendering with typography and style controls. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18463–18474, 2025
work page 2025
-
[30]
Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative mod- eling through stochastic differential equations. InInternational Conference on Learning Representations, 2021
work page 2021
-
[31]
Z. Song, J. Yang, Y . Huang, J. Tonglet, Z. Zhang, T. Cheng, M. Fang, I. Gurevych, and X. Chen. Geolocation with real human gameplay data: A large-scale dataset and human-like reasoning framework, 2026
work page 2026
-
[32]
Ł. Staniszewski, B. Cywi ´nski, F. Boenisch, K. Deja, and A. Dziedzic. Precise parameter localization for textual generation in diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[33]
D. Tang, Q. Jiang, J. Yang, J. Zhao, X. Du, M. Fang, and X. Zhang. Sltp: A symbolic travel-planning agent framework with decoupled translation and heuristic tree search.Electronics, 15(2), 2026
work page 2026
-
[34]
Y . Tuo, W. Xiang, J.-Y . He, Y . Geng, and X. Xie. Anytext: Multilingual visual text generation and editing. InInternational Conference on Learning Representations, 2024. 11
work page 2024
-
[35]
B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[36]
Y . Wang, C. Han, Y . Li, Z. Jin, X. Li, S. Du, W. Tao, S. Li, Y . Yang, C. Yuan, et al. Uniglyph: Unified segmentation-conditioned diffusion for precise visual text synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18335–18344, 2025
work page 2025
-
[37]
Y . Wang, W. Zhang, H. Xu, and C. Jin. Dreamtext: High fidelity scene text synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28555–28563, 2025
work page 2025
-
[38]
Z. Wang, J. Bao, S. Gu, D. Chen, W. Zhou, and H. Li. Designdiffusion: High-quality text-to-design image generation with diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 20906–20915, 2025
work page 2025
-
[39]
C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S.-m. Yin, S. Bai, X. Xu, Y . Chen, Y . Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, D. Liu, D. Li, H. Zhang, H. Meng, H. Wei, J. Ni, K. Chen, K. Cao, L. Peng, L. Qu, M. Wu, P. Wang, S. Yu, T. Wen, W. Feng, X. Xu, Y . Wang, Y . Zhang, Y . Zhu, Y . Wu, Y . Cai, and Z. Liu. Qwen-image technical r...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023
work page 2023
-
[41]
Z. Xue, J. Wu, Y . Gao, F. Kong, L. Zhu, M. Chen, Z. Liu, W. Liu, Q. Guo, W. Huang, and P. Luo. DanceGRPO: Unleashing GRPO on visual generation.arXiv preprint arXiv:2505.07818, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Z. Yan, J. Wang, A. Wang, Y . Li, W. Shang, and Z. Hangcheng. Textmaster: A unified framework for realistic text editing via glyph-style dual-control. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16112–16121, 2025
work page 2025
-
[43]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
J. Yang, M. Cui, H. Zhang, F. Ji, Z. Lai, and Y . Wang. Agent-based anti-jamming techniques for uav communications in adversarial environments: A comprehensive survey, 2025
work page 2025
- [45]
-
[46]
J. Yang, H. Wang, Q. Zhao, Z. Shi, Z. Song, and M. Fang. Efficient reinforcement learning via decoupling exploration and utilization. InInternational Conference on Intelligent Computing, pages 396–406. Springer, 2024
work page 2024
-
[47]
J. Yang, H. Zhang, F. Ji, Y . Wang, M. Wang, Y . Luo, and W. Ding. Frequency point game environment for uavs via expert knowledge and large language model.Drones, 10(2), 2026
work page 2026
-
[48]
Y . Yang, D. Gui, Y . Yuan, W. Liang, H. Ding, H. Hu, and K. Chen. Glyphcontrol: Glyph conditional control for visual text generation.Advances in Neural Information Processing Systems, 36:44050–44066, 2023
work page 2023
-
[49]
J. Yuan, X. Zhang, H. Zhou, J. Wang, Z. Qiu, Z. Shao, S. Zhang, S. Long, K. Kuang, K. Yao, et al. Hap: Structure-aware masked image modeling for human-centric perception.Advances in Neural Information Processing Systems, 36:50597–50616, 2023
work page 2023
-
[50]
Z-Image Team, H. Cai, S. Cao, R. Du, P. Gao, S. Hoi, Z. Hou, S. Huang, D. Jiang, X. Jin, L. Li, Z. Li, Z.-Y . Li, D. Liu, D. Liu, J. Shi, Q. Wu, F. Yu, C. Zhang, S. Zhang, and S. Zhou. Z-Image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [51]
- [52]
-
[53]
Y . Zhao and Z. Lian. Udifftext: A unified framework for high-quality text synthesis in arbitrary images via character-aware diffusion models. InEuropean conference on computer vision, pages 217–233. Springer, 2024. 12
work page 2024
-
[54]
Y . Zhu, J. Liu, F. Gao, W. Liu, X. Wang, P. Wang, F. Huang, C. Yao, and Z. Yang. Visual text generation in the wild. InEuropean Conference on Computer Vision, pages 89–106. Springer, 2024. A Details of the Text Rendering Benchmark Construction This section provides the full construction details summarized in Sec. 4.1. The pipeline is run independently pe...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.