TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

Fajri Koto; Fengxian Ji; Jiaming Wang; Jingpu Yang; Mingxuan Cui; Qian Jiang; Xiuying Chen; Zhecheng Shi; Zirui Song

arxiv: 2605.19320 · v2 · pith:NKMPEODNnew · submitted 2026-05-19 · 💻 cs.CV · cs.DB

TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

Mingxuan Cui , Jingpu Yang , Fengxian Ji , Qian Jiang , Zhecheng Shi , Jiaming Wang , Zirui Song , Fajri Koto

show 1 more author

Xiuying Chen

This is my paper

Pith reviewed 2026-05-20 06:55 UTC · model grok-4.3

classification 💻 cs.CV cs.DB

keywords text renderingpreference alignmenttext-to-image generationhierarchical rewardvision-language modelOCR accuracypost-training optimizationDPO and GRPO

0 comments

The pith

Text rendering in image generators improves by aligning preferences with a hierarchical VLM reward that judges errors at global, word, and glyph levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that text rendering can be improved in existing text-to-image models by framing it as a post-training preference alignment task instead of modifying the model architecture. It introduces a hierarchical vision-language model reward that breaks rendering defects into global layout, word, and individual glyph levels and turns those judgments into a scalar signal for optimization. This signal works with both Group Relative Policy Optimization and Direct Preference Optimization. Experiments on FLUX.1-dev and Z-Image-Turbo demonstrate higher OCR accuracy on rendered text while keeping overall image quality intact. A sympathetic reader cares because the method offers a way to fix a common failure mode without redesigning large foundation models.

Core claim

Text rendering is studied as a post-training preference-alignment problem. A hierarchical VLM-based reward decomposes rendering errors into global, word, and glyph levels, converts binary defect judgments into a scalar preference signal, and supports both GRPO and DPO. This produces consistent gains in OCR-based text accuracy on FLUX.1-dev and Z-Image-Turbo without degrading general generation quality, outperforming baselines such as SD3.5, Qwen-Image, AnyText, and TextDiffuser.

What carries the argument

hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels then converts binary defect judgments into a scalar preference signal

If this is right

OCR accuracy on text in generated images rises consistently on the tested foundation models.
General image generation quality remains unchanged or is preserved.
The same reward signal works for both GRPO and DPO optimization.
The approach compares favorably to existing text-rendering methods that require architecture changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hierarchical reward design could transfer to other fine-grained control tasks such as layout or style in generative models.
Multi-scale error decomposition might help in related domains like video generation where text elements must remain legible across frames.
Reward modeling focused on specific capabilities may reduce the need for full model retraining when scaling foundation systems.

Load-bearing premise

The hierarchical VLM-based reward accurately decomposes and judges rendering errors at global, word, and glyph levels to produce a reliable scalar preference signal that improves the generator.

What would settle it

Running the TextAlign process on a new text-to-image model and measuring no improvement in OCR accuracy on generated text or a measurable drop in general image quality compared with the unaligned base model.

Figures

Figures reproduced from arXiv: 2605.19320 by Fajri Koto, Fengxian Ji, Jiaming Wang, Jingpu Yang, Mingxuan Cui, Qian Jiang, Xiuying Chen, Zhecheng Shi, Zirui Song.

**Figure 1.** Figure 1: Text rendering results. Representative 720 × 720 samples generated by our aligned models. TextAlign renders legible and well-formed visual text across diverse carriers, styles, layouts, and text lengths while preserving coherent image content. model can require non-trivial engineering and may disturb the pretrained generative prior that gives modern models their broad visual competence. We take a different… view at source ↗

**Figure 2.** Figure 2: Our hierarchical reward mechanism. Given a generated image x and reference text y, three independent VLM calls produce binary indicators at the global, word and glyph levels, which are aggregated into a scalar reward R that drives either GRPO or DPO. model’s qualitative judgement into parsable signals. Let Nv ≤ N denote the number of indicators successfully parsed for a given sample. We define the scalar r… view at source ↗

**Figure 3.** Figure 3: User study. Human preference votes on text fidelity and visual integration. Our GRPO-aligned models outperform prior baselines and base generators on both criteria, with Z-Image (Our GRPO) preferred most. 4.4 Evaluation on External Dataset To test whether the gains from TextAlign transfer beyond our constructed benchmark, we further evaluate the same models on a 500-sample split of the external MARIO-Eval… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of text rendering results. Given the same prompts, GRPOaligned FLUX and Z-Image produce more faithful and legible visual text while preserving the surrounding visual context. F1-score, although some ablated variants slightly improve a single metric such as NED or strict accuracy. Overall, the three reward levels are complementary: global feedback stabilizes readable text structure, … view at source ↗

**Figure 5.** Figure 5: Robustness to text length and spatial placement. Radar visualizations of FLUX (Our GRPO) and Z-Image-Turbo (Our GRPO) across text-length and position subsets. Academic Advertisement Artistic Basic Cover Handwriting Logo Poster Scene Sticker [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results across visual categories. Z-Image (Our GRPO) renders legible text across diverse visual text scenarios while preserving category-specific style and layout. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

Faithful text rendering remains a persistent weakness of large text-to-image generative models, as it requires both semantic instruction following and fine-grained glyph-level structure. Prior methods often improve this ability through architecture-specific modules or encoder modifications, which complicate deployment across foundation models. We study text rendering as a post-training preference-alignment problem and propose TextAlign, a non-invasive framework that keeps the generator architecture unchanged. The key component is a hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal. The resulting signal supports both Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO). Experiments on FLUX.1-dev and Z-Image-Turbo show consistent gains in OCR-based text accuracy without degrading general generation quality. Compared with strong foundation and text-rendering baselines, including SD3.5, Qwen-Image, AnyText, and TextDiffuser, these results indicate that reward design offers a scalable alternative to model redesign for improving text rendering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TextAlign reframes text rendering as post-training alignment with a hierarchical VLM reward that breaks errors into global/word/glyph levels, but the gains rest on unverified VLM judgment quality.

read the letter

The main takeaway is that this paper treats better text rendering in text-to-image models as a preference alignment problem rather than an architecture redesign. They keep the generator fixed and instead build a hierarchical VLM reward that scores errors at global, word, and glyph scales, converts those into a scalar signal, and feeds it into GRPO or DPO. The claim is that this produces measurable OCR gains on FLUX.1-dev and Z-Image-Turbo without hurting general image quality, and that it beats several baselines including AnyText and TextDiffuser.

Referee Report

2 major / 2 minor

Summary. The paper introduces TextAlign, a non-invasive post-training framework for improving text rendering in text-to-image models via preference alignment. It employs a hierarchical VLM-based reward that decomposes rendering errors at global, word, and glyph levels, converting binary defect judgments into scalar preference signals usable with GRPO or DPO. Experiments on FLUX.1-dev and Z-Image-Turbo claim consistent OCR accuracy gains without degrading general generation quality, outperforming baselines such as SD3.5, Qwen-Image, AnyText, and TextDiffuser.

Significance. If the central empirical claims hold, the work demonstrates that reward design can serve as a scalable alternative to architecture-specific modifications for enhancing fine-grained text rendering in foundation models. This approach preserves model compatibility and could generalize across generators. The hierarchical decomposition targets the multi-scale nature of text errors, which is a strength if the VLM judgments prove reliable and stable.

major comments (2)

[§3 (Method)] The central claim rests on the hierarchical VLM reward producing an accurate and stable preference signal. No validation of the VLM's binary defect judgments (e.g., inter-rater agreement with humans or error analysis at the glyph level) is provided, despite known limitations of VLMs on fine-grained visual reasoning; this directly affects whether the reported OCR gains on FLUX.1-dev and Z-Image-Turbo can be attributed to the method.
[§4 (Experiments)] Experiments assert consistent OCR-based text accuracy gains without degrading quality and superiority to baselines, yet the available description provides no quantitative metrics, error bars, dataset sizes, ablation results, or statistical tests. This leaves the magnitude, reliability, and specificity of the improvements unverified.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., OCR accuracy delta) to support the empirical claims.
[§3.2] Notation for the scalar preference signal derivation from binary judgments could be clarified with an explicit equation or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions that will be incorporated to strengthen the manuscript.

read point-by-point responses

Referee: [§3 (Method)] The central claim rests on the hierarchical VLM reward producing an accurate and stable preference signal. No validation of the VLM's binary defect judgments (e.g., inter-rater agreement with humans or error analysis at the glyph level) is provided, despite known limitations of VLMs on fine-grained visual reasoning; this directly affects whether the reported OCR gains on FLUX.1-dev and Z-Image-Turbo can be attributed to the method.

Authors: We agree that explicit validation of the VLM judgments would strengthen attribution of the OCR gains to the hierarchical reward. The current manuscript supports the reward's utility through consistent downstream OCR improvements and outperformance versus baselines, but we acknowledge this is indirect evidence. We will add a dedicated validation subsection with human inter-rater agreement (Cohen's kappa) on a sampled set of global/word/glyph judgments and a glyph-level error breakdown comparing VLM decisions to human annotations. revision: yes
Referee: [§4 (Experiments)] Experiments assert consistent OCR-based text accuracy gains without degrading quality and superiority to baselines, yet the available description provides no quantitative metrics, error bars, dataset sizes, ablation results, or statistical tests. This leaves the magnitude, reliability, and specificity of the improvements unverified.

Authors: We appreciate the request for greater experimental transparency. The manuscript contains tables reporting OCR accuracy on FLUX.1-dev and Z-Image-Turbo together with baseline comparisons, but we agree the presentation can be expanded. In revision we will include per-experiment dataset sizes, standard deviations or error bars on OCR metrics, ablation results isolating each level of the hierarchical reward, and paired statistical tests (e.g., Wilcoxon signed-rank) to establish significance of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity: reward signal derived from external VLM judgments

full rationale

The paper describes TextAlign as a post-training preference alignment method that applies a hierarchical VLM-based reward to decompose text rendering errors at global, word, and glyph levels, converting binary defect judgments into a scalar preference signal for GRPO and DPO. No equations, derivations, or self-referential definitions appear in the abstract or described framework that reduce the claimed OCR accuracy gains to fitted parameters or tautological inputs by construction. The reward originates from external VLM evaluations rather than internal fits or self-citations that bear the central load, and experiments on FLUX.1-dev and Z-Image-Turbo are presented as empirical validation against baselines. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a VLM can reliably produce multi-level binary defect judgments convertible to scalar preferences. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption A vision-language model can accurately judge text rendering defects at global, word, and glyph levels and convert these into a useful preference signal.
This underpins the entire reward model and subsequent alignment training described in the abstract.

pith-pipeline@v0.9.0 · 5742 in / 1216 out tokens · 27511 ms · 2026-05-20T06:55:41.783937+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.