pith. machine review for the scientific record. sign in

arxiv: 2603.07119 · v2 · submitted 2026-03-07 · 💻 cs.CV

Recognition: no theorem link

TIQA: Human-Aligned Perceptual Text Quality Assessment in Generated Images

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords text quality assessmentAI-generated imagesperceptual metricstext renderingno-reference evaluationhuman alignmenttext-to-image models
0
0 comments X

The pith

Perceptual quality of text in AI-generated images can be scored separately from semantics using a dedicated no-reference model that reaches 0.94 correlation with human judgments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a new task, Text-in-Image Quality Assessment (TIQA), that isolates how well rendered text looks to humans from both overall image realism and semantic correctness. It releases two datasets of AI-generated images with human mean-opinion-score labels collected across multiple generators. The authors then introduce ANTIQA, a lightweight predictor built with text-specific biases, and show it aligns closely with human ratings on both cropped text regions and full images from unseen generators. When used to pick the best image out of five candidates, the model raises average text quality by 0.36 points on the human scale.

Core claim

Perceptual text quality in generated images forms a distinct, measurable dimension that can be predicted without reference images or semantic understanding; ANTIQA achieves PLCC/SROCC of 0.942/0.935 on labeled text crops and 0.842/0.837 on full images from unseen generators, and raises text-quality MOS by 0.36 points when selecting among five outputs.

What carries the argument

ANTIQA, a lightweight neural predictor that applies text-specific inductive biases to detected text regions to produce a perceptual quality score independent of semantic content.

If this is right

  • Enables automatic ranking and filtering of generated images by text rendering quality without new human labels.
  • Supports generation-time selection that improves selected image text quality by 14 percent on the human MOS scale.
  • Provides a reproducible benchmark for comparing text rendering performance across current and future text-to-image models.
  • Separates visual typography defects from semantic errors so each can be optimized independently.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The metric could be inserted into training objectives so that generators directly optimize for readable text rather than only global realism.
  • Similar region-specific perceptual predictors might be developed for other localized failure modes such as fine details in faces or hands.
  • Large-scale use of the scorer would let researchers track whether progress in overall image quality is accompanied by commensurate gains in text legibility.

Load-bearing premise

The mean-opinion scores from the 10k labeled crops and 1,500 full images remain stable and representative of human perception for text quality across future generators.

What would settle it

Test ANTIQA on a new text-to-image generator released after the datasets were collected and measure whether its PLCC on text-quality MOS falls below 0.75.

Figures

Figures reproduced from arXiv: 2603.07119 by Aleksandr Gushchin, Anastasia Antsiferova, Dmitriy Vatolin, Kirill Koltsov.

Figure 1
Figure 1. Figure 1: Examples of text rendering artifacts in AI-generated [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Text-in-Image Quality Assessment (TIQA). Left: AI-generated images contain multiple text regions that [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ANTIQA architecture. Each text crop is converted to grayscale, concatenated with a Sobel edge map, and then processed [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Box-plot distributions of OQ-MOS and TQ-MOS for [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An example of gradual crop distortion, from left to [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Binned mean normalized Levenshtein similarity as a function of predicted by TIQA models score. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt and seed dependencies across text-to-image models ON TIQA-Images. Colour encodes the mean TQ-MOS [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: TIQA-Crops examples D Human Study Protocol D.1 Participans’ ability to separately visual quality from semantics To isolate semantic plausibility from rendering artifacts, we gen￾erate text crops with an identical prompt template, layout, and typography, varying only the target string: a real word (world), an anagram with identical characters (wrodl), and a random non￾word of the same length (wuzxh). We use… view at source ↗
Figure 10
Figure 10. Figure 10: Distibution plot for OQ-MOS and TQ-MOS for [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Representative visual examples of the rating cate [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: TIQA-Images examples for overall quality [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: TIQA-Images examples for text quality 23 [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
read the original abstract

Recent text-to-image models have improved global realism, but text rendering remains a persistent failure mode: images may look convincing overall, yet local typography often contains malformed glyphs, broken strokes, irregular spacing, and other artifacts that humans heavily penalize. We formulate Text-in-Image Quality Assessment (TIQA), a no-reference task that estimates a human-aligned perceptual quality score for detected text regions while disentangling visual text quality from semantic correctness. To support this setting, we introduce two datasets. TIQA-Crops contains 120k text crops from 36k AI-generated images produced by 12 generators, with 10k mean-opinion-score (MOS) labels and 110k proxy labels for pretraining. TIQA-Images contains 1,500 text-heavy images from 10 recent generators, including proprietary systems, with paired overall-quality and text-quality subjective scores. We also propose ANTIQA, a lightweight predictor with text-specific inductive biases. Across crop-level and image-level evaluations, ANTIQA achieves the best alignment with human judgments, reaching PLCC/SROCC of 0.942/0.935 on TIQA-Crops and 0.842/0.837 for text-quality MOS on unseen generators in TIQA-Images. In best-of-5 AI-generated image ranking, ANTIQA improves the text quality of the selected image by 0.36 MOS (14%), demonstrating utility for benchmarking, filtering, and generation-time selection. Together, these findings establish perceptual text quality as a distinct evaluation target for modern text-to-image generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces the TIQA no-reference task for estimating human-aligned perceptual quality scores of text regions in AI-generated images, disentangling visual quality from semantics. It releases two datasets—TIQA-Crops (120k crops from 36k images by 12 generators, with 10k MOS labels) and TIQA-Images (1,500 text-heavy images from 10 generators with paired scores)—and proposes the lightweight ANTIQA predictor with text-specific biases. ANTIQA reports PLCC/SROCC of 0.942/0.935 on TIQA-Crops and 0.842/0.837 on text-quality MOS for unseen generators in TIQA-Images, plus a 0.36 MOS (14%) improvement in best-of-5 ranking.

Significance. If the correlations and ranking gains hold under verification, the work supplies a practical perceptual metric for a known failure mode in text-to-image models. The datasets and ANTIQA could support benchmarking, filtering, and generation-time selection; the independent MOS collection avoids obvious circularity and the unseen-generator split provides a basic test of transfer.

major comments (3)
  1. [Abstract] Abstract: the headline PLCC/SROCC figures (0.942/0.935 and 0.842/0.837) are given without error bars, standard deviations across folds, or statistical significance tests against baselines, so it is impossible to judge whether the claimed superiority is reliable or within noise.
  2. [Abstract] Abstract: the central generalization claim—that ANTIQA works “across future generators”—rests on the untested assumption that the 10k MOS labels from only 12+10 generators already cover the space of human-perceptible text artifacts; no experiments probe novel glyph statistics, post-2024 diffusion artifacts, or new architectures that could shift the perceptual mapping.
  3. [Abstract] Abstract: no ablation results are reported for the text-specific inductive biases in ANTIQA, so the contribution of each design choice to the reported correlations cannot be isolated and the model’s claimed lightness and specificity remain unverified.
minor comments (1)
  1. [Abstract] The abstract should explicitly state whether the datasets will be publicly released, as this directly affects reproducibility of the 10k MOS labels and the reported numbers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We will revise the manuscript to strengthen the statistical reporting, clarify the generalization claims, and add ablation studies as detailed below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline PLCC/SROCC figures (0.942/0.935 and 0.842/0.837) are given without error bars, standard deviations across folds, or statistical significance tests against baselines, so it is impossible to judge whether the claimed superiority is reliable or within noise.

    Authors: We agree that error bars, standard deviations across folds, and statistical significance tests are necessary to substantiate the superiority claims. In the revised version, we will report mean PLCC/SROCC with standard deviations computed over multiple cross-validation folds and include pairwise statistical significance tests (e.g., Steiger's test or bootstrap confidence intervals) against the baselines. revision: yes

  2. Referee: [Abstract] Abstract: the central generalization claim—that ANTIQA works “across future generators”—rests on the untested assumption that the 10k MOS labels from only 12+10 generators already cover the space of human-perceptible text artifacts; no experiments probe novel glyph statistics, post-2024 diffusion artifacts, or new architectures that could shift the perceptual mapping.

    Authors: The reported generalization is supported by the explicit unseen-generator split in TIQA-Images (10 generators held out from training), which demonstrates transfer to new generator families. We acknowledge that exhaustive coverage of all possible future artifacts is impossible and will revise the abstract wording to specify 'unseen generators' rather than 'future generators' while adding a limitations paragraph discussing potential shifts in perceptual mappings from novel architectures. revision: partial

  3. Referee: [Abstract] Abstract: no ablation results are reported for the text-specific inductive biases in ANTIQA, so the contribution of each design choice to the reported correlations cannot be isolated and the model’s claimed lightness and specificity remain unverified.

    Authors: We agree that ablations are required to isolate the impact of the text-specific inductive biases. The revised manuscript will include a dedicated ablation study removing or replacing each bias component (e.g., glyph-aware convolutions, spacing priors) and reporting the resulting drops in PLCC/SROCC on both TIQA-Crops and TIQA-Images. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper collects independent human MOS labels on TIQA-Crops (10k labels) and TIQA-Images (1,500 images) from separate generators, then trains ANTIQA on those labels and reports correlation on held-out crops and unseen-generator images. No equations, self-citations, or fitted parameters are presented as independent predictions; the reported PLCC/SROCC values and MOS improvement are direct empirical measurements against external human judgments. The derivation remains self-contained against the collected benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human MOS labels collected on the described crops and images constitute a reliable, generalizable ground truth for perceptual text quality; no free parameters or invented entities are visible in the abstract.

axioms (1)
  • domain assumption Human mean-opinion scores collected on the 10k labeled crops and 1,500 images are stable and representative of general perceptual text quality.
    Invoked implicitly when the paper treats the collected MOS as the target for training and evaluation.

pith-pipeline@v0.9.0 · 5597 in / 1336 out tokens · 28845 ms · 2026-05-15T14:29:16.888785+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 8 internal anchors

  1. [1]

    Stability AI. 2022. DeepFloyd IF (IF-I-M and related checkpoints). https:// stability.ai/news/deepfloyd-if-text-to-image-model Model card

  2. [2]

    Stability AI. 2022. Stable Diffusion v2.1. https://huggingface.co/qualcomm/ Stable-Diffusion-v2.1 Model card. Accessed: 2026-01-29

  3. [3]

    Stability AI. 2024. Stable Diffusion 3 Medium. https://huggingface.co/stabilityai/ stable-diffusion-3-medium Model card / release. Accessed: 2026-01-29

  4. [4]

    Stability AI. 2024. Stable Diffusion 3 Medium (announcement). https://stability. ai/news/stable-diffusion-3-medium Blog / release. Accessed: 2026-01-29

  5. [5]

    Stability AI. 2024. Stable Diffusion 3.5 Large. https://huggingface.co/stabilityai/ stable-diffusion-3.5-large Model card. Accessed: 2026-01-29

  6. [6]

    Stability AI. 2024. Stable Diffusion 3.5 Large Turbo. https://huggingface.co/ stabilityai/stable-diffusion-3.5-large-turbo Model card / release. Accessed: 2026- 01-29

  7. [7]

    Qwen Team / Alibaba. 2025. Qwen-Image (model repo / release). https://github. com/QwenLM/Qwen-Image Model repo / release. Accessed: 2026-01-29

  8. [8]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, et al. 2025. Qwen3-VL Technical Report. arXiv:2...

  9. [9]

    Zenab Bosheah and Vilmos Bilicki. 2025. Challenges in Generating Accurate Text in Images: A Benchmark for Text-to-Image Models on Specialized Content. Applied Sciences15, 5 (2025), 2274

  10. [10]

    Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. 2024. Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images. InForty-first International Conference on Machine Learning

  11. [11]

    Chaofeng Chen, Jiadi Mo, Jingwen Hou, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, and Weisi Lin. 2024. TOPIQ: A Top-Down Approach From Seman- tics to Distortions for Image Quality Assessment.IEEE Transactions on Image Processing33 (2024), 2404–2418. doi:10.1109/TIP.2024.3378466

  12. [12]

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhong- dao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. 2024. PixArt -𝜎: Weak- to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation. https://arxiv.org/abs/2403.04692 arXiv preprint / project page. Accessed: 2026- 01-29

  13. [13]

    Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei

  14. [14]

    Textdiffuser: Diffusion models as text painters.Advances in Neural Infor- mation Processing Systems36 (2023), 9353–9387

  15. [15]

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhong- dao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. 2023. PixArt-𝛼: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthe- sis. https://arxiv.org/abs/2310.00426 arXiv preprint. Accessed: 2026-01-29

  16. [16]

    EM Colombo, CF Kirschbaum, and M Raitelli. 1987. Legibility of texts: The influence of blur.Lighting Research & Technology19, 3 (1987), 61–71

  17. [17]

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. 2025. PaddleOCR 3.0 Technical Report. arXiv:2507.05595 [cs.CV] https://arxiv.org/abs/2507.05595

  18. [18]

    Google DeepMind. 2025. Imagen 4 (Imagen 4 Fast) — Google / DeepMind. https://deepmind.google/models/imagen/ Model page / announcement. Accessed: 2026-01-29

  19. [19]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

  20. [20]

    Luca Eyring, Shyamgopal Karthik, Karsten Roth, Alexey Dosovitskiy, and Zeynep Akata. 2024. Reno: Enhancing one-step text-to-image models through reward- based noise optimization.Advances in Neural Information Processing Systems37 (2024), 125487–125519

  21. [21]

    Forouzan Fallah, Maitreya Patel, Agneet Chatterjee, Vlad Morariu, Chitta Baral, and Yezhou Yang. 2025. Textinvision: Text and prompt complexity driven vi- sual text generation benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 525–534

  22. [22]

    Black Forest Labs (FLUX). 2026. FLUX.1 [dev] (model card). https://huggingface. co/black-forest-labs/FLUX.1-dev Model card. Accessed: 2026-01-29

  23. [23]

    Google. 2025. Nano Banana Pro (Gemini 3 Pro Image). https://blog.google/ innovation-and-ai/products/nano-banana-pro/ Product blog. Accessed: 2026-01- 29

  24. [24]

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, and Jian Guo. 2024. A Survey on LLM-as-a-Judge.arXiv preprint arXiv: 2411.15594(2024)

  25. [25]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 770–778

  26. [26]

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems30 (2017)

  27. [27]

    Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-Excitation Networks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7132–7141. doi:10.1109/CVPR.2018.00745

  28. [28]

    Ideogram. 2025. Ideogram 3 (Ideogram v3 / v3 turbo). https://ideogram.ai/ features/3.0 Product / model page. Accessed: 2026-01-29

  29. [29]

    JaidedAI. [n. d.]. EasyOCR. https://github.com/JaidedAI/EasyOCR. Accessed: 2026-03-12

  30. [30]

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. 2023. Pick-a-pic: An open dataset of user preferences for text-to- image generation.Advances in neural information processing systems36 (2023), 36652–36663

  31. [31]

    Alexander Korotin, Daniil Selikhanovych, and Evgeny Burnaev. 2022. Neural optimal transport.arXiv preprint arXiv:2201.12220(2022)

  32. [32]

    Black Forest Labs. 2024. FLUX1.1 Pro (product / model page). https://bfl.ai/ models/flux-pro Vendor model page. Accessed: 2026-01-29

  33. [33]

    Black Forest Labs. 2025. FLUX.2 [max] (model / product page). https://replicate. com/black-forest-labs/flux-2-max Model / API page. Accessed: 2026-01-29

  34. [34]

    Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. 2023. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192 (2023)

  35. [35]

    Chunyi Li, Tengchuan Kou, Yixuan Gao, Yuqin Cao, Wei Sun, Zicheng Zhang, Yingjie Zhou, Zhichao Zhang, Weixia Zhang, Haoning Wu, et al. 2024. Aigiqa- 20k: A large database for ai-generated image quality assessment. InIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 6327–6336

  36. [36]

    Hui Li, Peng Wang, Chunhua Shen, and Guyu Zhang. 2019. Show, attend and read: a simple and strong baseline for irregular text recognition. InPro- ceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artifici...

  37. [37]

    Xiongkuo Min, Ke Gu, Guangtao Zhai, Xiaokang Yang, Wenjun Zhang, Patrick Le Callet, and Chang Wen Chen. 2021. Screen content quality assessment: Overview, benchmark, and beyond.ACM Computing Surveys (CSUR)54, 9 (2021), 1–36

  38. [38]

    Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. 2012. No- reference image quality assessment in the spatial domain.IEEE Transactions on image processing21, 12 (2012), 4695–4708

  39. [39]

    Novita. 2024. Novita AI. https://novita.ai. Accessed: 2024-03-01

  40. [40]

    OpenAI. 2025. ChatGPT Images (image generation) — OpenAI. https://help. openai.com/en/articles/8932459-creating-images-in-chatgpt Docs / feature page. Accessed: 2026-01-29

  41. [41]

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. SDXL: Improving Latent Diffusion Models for High-Fidelity Image Generation. https://arxiv.org/abs/ 2307.01952 arXiv preprint. Accessed: 2026-01-29

  42. [42]

    Anton Razzhigaev, Arseniy Shakhmatov, Anastasia Maltseva, Vladimir Arkhip- kin, Igor Pavlov, Ilya Ryabov, Angelina Kuts, Alexander Panchenko, Andrey Kuznetsov, and Denis Dimitrov. 2023. Kandinsky: An Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion (Kandinsky 2). https://arxiv. org/abs/2310.03502 arXiv preprint. Accessed: 2026-01-29

  43. [43]

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans.Advances in neural information processing systems29 (2016)

  44. [44]

    Georgia Gabriela Sampaio, Ruixiang Zhang, Shuangfei Zhai, Jiatao Gu, Josh Susskind, Navdeep Jaitly, and Yizhe Zhang. 2024. Typescore: A text fidelity metric for text-to-image generative models.arXiv preprint arXiv:2411.02437 9 Kirill Koltsov, Aleksandr Gushchin, Anastasia Antsiferova, and Dmitriy Vatolin (2024)

  45. [45]

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. 2022. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems 35 (2022), 25278–25294

  46. [46]

    ByteDance / Seedream. 2025. Seedream 4.5 (ByteDance / Seedream). https: //byteplus.com/en/product/Seedream Product / API page. Accessed: 2026-01-29

  47. [47]

    Wataru Shimoda, Naoto Inoue, Daichi Haraguchi, Hayato Mitani, Seiichi Uchida, and Kota Yamaguchi. 2025. Type-r: Automatically retouching typos for text-to- image generation. InProceedings of the Computer Vision and Pattern Recognition Conference. 2745–2754

  48. [48]

    Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang. 2020. Blindly assess image quality in the wild guided by a self-adaptive hyper network. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3667–3676

  49. [49]

    Hossein Talebi and Peyman Milanfar. 2018. NIMA: Neural image assessment. IEEE transactions on image processing27, 8 (2018), 3998–4011

  50. [50]

    RapidAI Team. 2021. Rapid OCR: OCR Toolbox. https://github.com/RapidAI/ RapidOCR

  51. [51]

    V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Haochen Li, Jiale Zhu, Jiali Chen, Ji...

  52. [52]

    Tongyi-MAI. 2025. Z-Image-Turbo (Tongyi-MAI / Alibaba). https://huggingface. co/Tongyi-MAI/Z-Image-Turbo Model card / repo. Accessed: 2026-01-29

  53. [55]

    Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie

  54. [56]

    Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054(2023)

  55. [57]

    UC Berkeley. n.d.. LMArena. https://lmarena.ai/leaderboard/text-to-image

  56. [58]

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. 2024. OmniGen: Unified Image Generation. https://arxiv.org/abs/2409.11340 arXiv preprint / project repo. Accessed: 2026-01-29

  57. [59]

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. 2023. Imagereward: Learning and evaluating human prefer- ences for text-to-image generation.Advances in Neural Information Processing Systems36 (2023), 15903–15935

  58. [60]

    Yandex. n.d.. Yandex.Tasks. https://tasks.yandex.com. Accessed 20 December 2025

  59. [61]

    Peng Ye and David Doermann. 2013. Document image quality assessment: A brief survey. In2013 12th International Conference on Document Analysis and Recognition. IEEE, 723–727

  60. [62]

    Xingsong Ye, Yongkun Du, Yunbo Tao, and Zhineng Chen. 2025. Textssr: Diffusion-based data synthesis for scene text recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision. 17464–17473

  61. [63]

    Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. 2025. Teaching Large Language Models to Regress Accurate Image Quality Scores using Score Distribution. InIEEE/CVF Conference on Computer Vision and Pattern Recognition. 14483–14494

  62. [64]

    Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, and Chao Dong

  63. [65]

    InEuropean Conference on Computer Vision

    Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models. InEuropean Conference on Computer Vision. 259– 276

  64. [66]

    Zai-Org. 2025. CogView4 (repo / model). https://github.com/zai-org/CogView4 Project / model release. Accessed: 2026-01-29

  65. [67]

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

  66. [68]

    InProceedings of the IEEE conference on computer vision and pattern recognition

    The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition. 586–595

  67. [69]

    Tianyu Zhang, Xinyu Wang, Zhenghan Tai, Lu Li, Jijun Chi, Jingrui Tian, Hailin He, and Suyuchen Wang. 2025. STRICT: Stress Test of Rendering Images Con- taining Text.arXiv preprint arXiv:2505.18985(2025)

  68. [70]

    Zicheng Zhang, Tengchuan Kou, Shushi Wang, Chunyi Li, Wei Sun, Wei Wang, Xiaoyu Li, Zongyu Wang, Xuezhi Cao, Xiongkuo Min, et al. 2025. Q-eval-100k: Evaluating visual quality and alignment level for text-to-vision content. InPro- ceedings of the Computer Vision and Pattern Recognition Conference. 10621–10631

  69. [71]

    no text detected

    Hanwei Zhu, Haoning Wu, Yixuan Li, Zicheng Zhang, Baoliang Chen, Lingyu Zhu, Yuming Fang, Guangtao Zhai, Weisi Lin, and Shiqi Wang. 2024. Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=mHtOyh5taj 10 TIQA: Human...