TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering

Alex Jinpeng Wang; Dongxing Mao; Linjie Li; Yilin Wang; Zhengyuan Yang

arxiv: 2604.24459 · v1 · submitted 2026-04-27 · 💻 cs.CV

TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering

Dongxing Mao , Yilin Wang , Linjie Li , Zhengyuan Yang , Alex Jinpeng Wang This is my paper

Pith reviewed 2026-05-08 04:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords TextGround4Mtext-to-image generationlayout-aware text renderingprompt-aligned datasetbounding box annotationsautoregressive modelstext fidelityspatial accuracy

0 comments

The pith

A 4-million-pair dataset with span-level text and bounding-box annotations lets text-to-image models learn accurate prompt-grounded layouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TextGround4M, a dataset of over 4 million prompt-image pairs where each prompt is paired with span-level text content and corresponding bounding boxes. This resource supplies the fine-grained alignment missing from prior collections, especially for cases with multiple structured text elements. The authors show that appending layout-aware span tokens during training of autoregressive models improves text fidelity, spatial placement, and overall prompt consistency on a new stratified benchmark. They also define two layout-aware metrics to quantify spatial accuracy. If the approach holds, existing text-to-image systems can incorporate better text rendering through data and lightweight supervision rather than architectural overhaul.

Core claim

Training autoregressive text-to-image models on TextGround4M by appending layout-aware span tokens during training produces outputs with higher text fidelity, spatial accuracy, and prompt consistency than strong baselines, as measured in zero-shot evaluation on a benchmark stratified by layout complexity and using two new layout-aware metrics.

What carries the argument

Span-level annotations that link prompt text segments directly to image bounding boxes, combined with appended layout-aware span tokens that supply supervision only during training.

If this is right

Multi-span and structured text can be rendered with positions that match the prompt structure more closely.
Zero-shot evaluation of layout quality becomes possible through the stratified benchmark and the two new metrics.
The same lightweight token-append strategy can be applied to other autoregressive models without changing their inference behavior.
Prompt consistency improves because supervision is tied directly to prompt spans rather than global image features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset and token method could be adapted to video or 3D generation tasks where text must appear consistently across frames or views.
Public release of the annotations might allow researchers to test whether similar fine-grained grounding helps reduce hallucinations in other multimodal tasks.
If the span-token approach scales, future models might need fewer post-hoc text correction steps in production pipelines.

Load-bearing premise

The automatically or manually created span-level annotations accurately capture the intended text content and spatial layout from the original prompts.

What would settle it

Retraining the same baseline models on TextGround4M and observing no gain or a drop in text-fidelity and spatial-accuracy scores on the stratified benchmark would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.24459 by Alex Jinpeng Wang, Dongxing Mao, Linjie Li, Yilin Wang, Zhengyuan Yang.

**Figure 1.** Figure 1: Comparison of existing datasets and ours. Exist view at source ↗

**Figure 2.** Figure 2: Overview of the dataset construction pipeline for TextGround4M. We collect 11.7M image-text pairs from both public view at source ↗

**Figure 3.** Figure 3: Visualization of structural statistics in view at source ↗

**Figure 4.** Figure 4: Training and inference pipeline of our method. During training, prompt-grounded span tokens and bounding box view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of generation results be view at source ↗

read the original abstract

Despite recent advances in text-to-image generation, models still struggle to accurately render prompt-specified text with correct spatial layout -- especially in multi-span, structured settings. This challenge is driven not only by the lack of datasets that align prompts with the exact text and layout expected in the image, but also by the absence of effective metrics for evaluating layout quality. To address these issues, we introduce TextGround4M, a large-scale dataset of over 4 million prompt-image pairs, each annotated with span-level text grounded in the prompt and corresponding bounding boxes. This enables fine-grained supervision for layout-aware, prompt-grounded text rendering. Building on this, we propose a lightweight training strategy for autoregressive T2I models that appends layout-aware span tokens during training, without altering model architecture or inference behavior. We further construct a benchmark with stratified layout complexity to evaluate both open-source and proprietary models in a zero-shot setting. In addition, we introduce two layout-aware metrics to address the long-standing lack of spatial evaluation in text rendering. Our results show that models trained on TextGround4M outperform strong baselines in text fidelity, spatial accuracy, and prompt consistency, highlighting the importance of fine-grained layout supervision for grounded T2I generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TextGround4M supplies a large new dataset with span-level text and layout annotations plus a token-based training trick, but the abstract gives no numbers to show whether it actually improves results.

read the letter

The paper's core offering is TextGround4M, a 4-million-pair dataset where each prompt-image example comes with span-level text grounded to bounding boxes. It pairs this with a simple training change that appends layout tokens to autoregressive text-to-image models without touching the architecture or inference, plus a complexity-stratified benchmark and two new spatial metrics for evaluation. These pieces directly target the known weakness in current models where prompt-specified text ends up wrong or badly placed, especially in multi-span scenes. The scale and the decision to keep supervision lightweight are practical strengths that could let others build on the data without major engineering overhead. The stratified benchmark also looks like a useful step beyond flat test sets that mix easy and hard cases together. The soft spot is the complete lack of quantitative evidence in the abstract. It states that trained models beat strong baselines on text fidelity, spatial accuracy, and prompt consistency, yet supplies no scores, no baseline details, no ablations, and no error analysis. That makes it impossible to judge effect size or whether the gains come from the annotations themselves. The stress-test concern about annotation quality lands: the paper says the spans are created automatically or manually, but reports nothing on agreement, precision against human gold, or noise levels. If those labels are systematically off, both the training signal and the benchmark become unreliable. This is aimed at researchers working on grounded text-to-image generation who need better supervision data or evaluation tools. It deserves peer review because the dataset size and the training approach are concrete and reproducible in principle, even though the results section will need substantial expansion to be convincing.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TextGround4M, a dataset of over 4 million prompt-image pairs each annotated with span-level text and corresponding bounding boxes to enable fine-grained supervision for layout-aware text rendering in text-to-image models. It proposes a lightweight training strategy for autoregressive T2I models that appends layout-aware span tokens without altering architecture or inference behavior. The work also constructs a stratified benchmark for zero-shot evaluation of layout complexity and introduces two new layout-aware metrics, claiming that models trained on TextGround4M outperform strong baselines in text fidelity, spatial accuracy, and prompt consistency.

Significance. If the annotation quality is validated and the reported outperformance is supported by ablations and quantitative controls, this dataset and training approach would be a useful contribution to addressing persistent challenges in accurate text rendering and spatial layout in T2I generation. The non-intrusive nature of the span-token training method is a practical strength that could facilitate adoption across existing models. The stratified benchmark and new metrics help fill an evaluation gap, though the overall significance depends on demonstrating that gains stem from genuine layout grounding rather than dataset construction artifacts.

major comments (2)

[Abstract and §3 (Dataset Construction)] Abstract and §3 (Dataset Construction): The central claim that models trained on TextGround4M outperform baselines in text fidelity, spatial accuracy, and prompt consistency depends on the span-level annotations faithfully reflecting intended prompt content and layout. The manuscript states annotations are 'automatically or manually created' but reports no inter-annotator agreement, precision-recall against human gold standards, or ablation removing noisy spans. Errors here would directly undermine both training supervision and the reliability of the stratified zero-shot benchmark, as the same source supplies ground-truth labels.
[§5 (Experiments)] §5 (Experiments): The abstract asserts performance gains but the provided description supplies no quantitative numbers, baseline details, ablation studies, or error analysis. Without these (e.g., specific metrics on text fidelity or spatial accuracy, comparison tables, or controls for prompt complexity), it is impossible to assess whether improvements are attributable to the layout supervision or to unstated choices in data splits, metrics, or training.

minor comments (2)

[§4 (Training Strategy)] The description of how span tokens are appended during training could include a concrete example or pseudocode to clarify the lightweight strategy and confirm it leaves inference unchanged.
[§5.2 (Metrics)] Clarify the exact definitions and formulas for the two new layout-aware metrics in the evaluation section to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments identify key areas where additional validation and detail will strengthen the presentation of our contributions. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses

Referee: [Abstract and §3 (Dataset Construction)] Abstract and §3 (Dataset Construction): The central claim that models trained on TextGround4M outperform baselines in text fidelity, spatial accuracy, and prompt consistency depends on the span-level annotations faithfully reflecting intended prompt content and layout. The manuscript states annotations are 'automatically or manually created' but reports no inter-annotator agreement, precision-recall against human gold standards, or ablation removing noisy spans. Errors here would directly undermine both training supervision and the reliability of the stratified zero-shot benchmark, as the same source supplies ground-truth labels.

Authors: We agree that the reliability of the span-level annotations is foundational to both the training approach and the benchmark. Section 3 describes a hybrid pipeline combining automated extraction with manual verification on sampled data, but we did not report quantitative validation such as inter-annotator agreement, precision-recall against gold standards, or an ablation on noisy spans. We will revise the manuscript to include these analyses, along with an ablation study measuring the effect of annotation noise on downstream performance. This will directly substantiate that the reported gains arise from faithful layout grounding rather than artifacts. revision: yes
Referee: [§5 (Experiments)] §5 (Experiments): The abstract asserts performance gains but the provided description supplies no quantitative numbers, baseline details, ablation studies, or error analysis. Without these (e.g., specific metrics on text fidelity or spatial accuracy, comparison tables, or controls for prompt complexity), it is impossible to assess whether improvements are attributable to the layout supervision or to unstated choices in data splits, metrics, or training.

Authors: We acknowledge that the abstract summarizes outcomes without numerical detail and that the experimental section as presented lacks sufficient quantitative support for the claims. We will revise the manuscript to expand §5 with explicit metric values (text fidelity, spatial accuracy, prompt consistency), full baseline comparisons, ablation studies isolating the contribution of span tokens, and error analysis stratified by layout complexity. Key quantitative highlights will also be added to the abstract. These additions will enable readers to evaluate whether gains are attributable to the layout supervision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset and benchmark construction with independent evaluation metrics.

full rationale

The paper presents a new dataset (TextGround4M) with span-level annotations, a lightweight training strategy of appending span tokens, a stratified benchmark, and two new layout-aware metrics. No equations, derivations, or predictions appear in the abstract or described content. Performance claims rest on empirical comparisons of models trained on the new data versus baselines, evaluated zero-shot on the benchmark. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations are present; the work is self-contained as a data contribution without reducing claims to prior fitted quantities or author-specific ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, derivations, or new physical entities; the work is entirely empirical and data-driven with no free parameters, axioms, or invented entities required for the central claim.

pith-pipeline@v0.9.0 · 5529 in / 1108 out tokens · 37916 ms · 2026-05-08T04:23:07.070994+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.CoRR, abs/2507.22058, 2025

Scaling rectified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learning. Geng, Z.; Wang, Y .; Ma, Y .; Li, C.; Rao, Y .; Gu, S.; Zhong, Z.; Lu, Q.; Hu, H.; Zhang, X.; Linus; Wang, D.; and Jiang, J. 2025. X-Omni: Reinforcement Learning Makes Dis- crete Autoregressive Image Generative Models Great ...

work page arXiv 2025
[2]

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal gener- ative pretraining, 2024a

Springer. Liu, D.; Zhao, S.; Zhuo, L.; Lin, W.; Qiao, Y .; Li, H.; and Gao, P. 2024. Lumina-mgpt: Illuminate flexible photore- alistic text-to-image generation with multimodal generative pretraining.arXiv preprint arXiv:2408.02657. Ma, J.; Zhao, M.; Chen, C.; Wang, R.; Niu, D.; Lu, H.; and Lin, X. 2023. GlyphDraw: Seamlessly Rendering Text with Intricate ...

work page arXiv 2024
[3]

Textatlas5m: A large-scale dataset for dense text image generation.arXiv preprint arXiv:2502.07870, 2025

TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation. arXiv:2502.07870. Wu, C.; Chen, X.; Wu, Z.; Ma, Y .; Liu, X.; Pan, Z.; Liu, W.; Xie, Z.; Yu, X.; Ruan, C.; and Luo, P. 2024. Janus: Decou- pling Visual Encoding for Unified Multimodal Understand- ing and Generation. arXiv:2410.13848. Xie, E.; Chen, J.; Zhao, Y .; Yu, J.; Zhu, L.; Wu, C.; ...

work page arXiv 2024

[1] [1]

X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.CoRR, abs/2507.22058, 2025

Scaling rectified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learning. Geng, Z.; Wang, Y .; Ma, Y .; Li, C.; Rao, Y .; Gu, S.; Zhong, Z.; Lu, Q.; Hu, H.; Zhang, X.; Linus; Wang, D.; and Jiang, J. 2025. X-Omni: Reinforcement Learning Makes Dis- crete Autoregressive Image Generative Models Great ...

work page arXiv 2025

[2] [2]

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal gener- ative pretraining, 2024a

Springer. Liu, D.; Zhao, S.; Zhuo, L.; Lin, W.; Qiao, Y .; Li, H.; and Gao, P. 2024. Lumina-mgpt: Illuminate flexible photore- alistic text-to-image generation with multimodal generative pretraining.arXiv preprint arXiv:2408.02657. Ma, J.; Zhao, M.; Chen, C.; Wang, R.; Niu, D.; Lu, H.; and Lin, X. 2023. GlyphDraw: Seamlessly Rendering Text with Intricate ...

work page arXiv 2024

[3] [3]

Textatlas5m: A large-scale dataset for dense text image generation.arXiv preprint arXiv:2502.07870, 2025

TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation. arXiv:2502.07870. Wu, C.; Chen, X.; Wu, Z.; Ma, Y .; Liu, X.; Pan, Z.; Liu, W.; Xie, Z.; Yu, X.; Ruan, C.; and Luo, P. 2024. Janus: Decou- pling Visual Encoding for Unified Multimodal Understand- ing and Generation. arXiv:2410.13848. Xie, E.; Chen, J.; Zhao, Y .; Yu, J.; Zhu, L.; Wu, C.; ...

work page arXiv 2024