arxiv: 2603.07119 · v2 · submitted 2026-03-07 · 💻 cs.CV

Recognition: no theorem link

TIQA: Human-Aligned Perceptual Text Quality Assessment in Generated Images

Kirill Koltsov , Aleksandr Gushchin , Anastasia Antsiferova , Dmitriy Vatolin

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords text quality assessmentAI-generated imagesperceptual metricstext renderingno-reference evaluationhuman alignmenttext-to-image models

0 comments

The pith

Perceptual quality of text in AI-generated images can be scored separately from semantics using a dedicated no-reference model that reaches 0.94 correlation with human judgments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a new task, Text-in-Image Quality Assessment (TIQA), that isolates how well rendered text looks to humans from both overall image realism and semantic correctness. It releases two datasets of AI-generated images with human mean-opinion-score labels collected across multiple generators. The authors then introduce ANTIQA, a lightweight predictor built with text-specific biases, and show it aligns closely with human ratings on both cropped text regions and full images from unseen generators. When used to pick the best image out of five candidates, the model raises average text quality by 0.36 points on the human scale.

Core claim

Perceptual text quality in generated images forms a distinct, measurable dimension that can be predicted without reference images or semantic understanding; ANTIQA achieves PLCC/SROCC of 0.942/0.935 on labeled text crops and 0.842/0.837 on full images from unseen generators, and raises text-quality MOS by 0.36 points when selecting among five outputs.

What carries the argument

ANTIQA, a lightweight neural predictor that applies text-specific inductive biases to detected text regions to produce a perceptual quality score independent of semantic content.

If this is right

Enables automatic ranking and filtering of generated images by text rendering quality without new human labels.
Supports generation-time selection that improves selected image text quality by 14 percent on the human MOS scale.
Provides a reproducible benchmark for comparing text rendering performance across current and future text-to-image models.
Separates visual typography defects from semantic errors so each can be optimized independently.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The metric could be inserted into training objectives so that generators directly optimize for readable text rather than only global realism.
Similar region-specific perceptual predictors might be developed for other localized failure modes such as fine details in faces or hands.
Large-scale use of the scorer would let researchers track whether progress in overall image quality is accompanied by commensurate gains in text legibility.

Load-bearing premise

The mean-opinion scores from the 10k labeled crops and 1,500 full images remain stable and representative of human perception for text quality across future generators.

What would settle it

Test ANTIQA on a new text-to-image generator released after the datasets were collected and measure whether its PLCC on text-quality MOS falls below 0.75.

Figures

Figures reproduced from arXiv: 2603.07119 by Aleksandr Gushchin, Anastasia Antsiferova, Dmitriy Vatolin, Kirill Koltsov.

**Figure 2.** Figure 2: Overview of Text-in-Image Quality Assessment (TIQA). Left: AI-generated images contain multiple text regions that [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: ANTIQA architecture. Each text crop is converted to grayscale, concatenated with a Sobel edge map, and then processed [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Box-plot distributions of OQ-MOS and TQ-MOS for [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: An example of gradual crop distortion, from left to [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Binned mean normalized Levenshtein similarity as a function of predicted by TIQA models score. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt and seed dependencies across text-to-image models ON TIQA-Images. Colour encodes the mean TQ-MOS [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: TIQA-Crops examples D Human Study Protocol D.1 Participans’ ability to separately visual quality from semantics To isolate semantic plausibility from rendering artifacts, we generate text crops with an identical prompt template, layout, and typography, varying only the target string: a real word (world), an anagram with identical characters (wrodl), and a random nonword of the same length (wuzxh). We use… view at source ↗

**Figure 10.** Figure 10: Distibution plot for OQ-MOS and TQ-MOS for [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Representative visual examples of the rating cate [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: TIQA-Images examples for overall quality [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: TIQA-Images examples for text quality 23 [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

read the original abstract

Recent text-to-image models have improved global realism, but text rendering remains a persistent failure mode: images may look convincing overall, yet local typography often contains malformed glyphs, broken strokes, irregular spacing, and other artifacts that humans heavily penalize. We formulate Text-in-Image Quality Assessment (TIQA), a no-reference task that estimates a human-aligned perceptual quality score for detected text regions while disentangling visual text quality from semantic correctness. To support this setting, we introduce two datasets. TIQA-Crops contains 120k text crops from 36k AI-generated images produced by 12 generators, with 10k mean-opinion-score (MOS) labels and 110k proxy labels for pretraining. TIQA-Images contains 1,500 text-heavy images from 10 recent generators, including proprietary systems, with paired overall-quality and text-quality subjective scores. We also propose ANTIQA, a lightweight predictor with text-specific inductive biases. Across crop-level and image-level evaluations, ANTIQA achieves the best alignment with human judgments, reaching PLCC/SROCC of 0.942/0.935 on TIQA-Crops and 0.842/0.837 for text-quality MOS on unseen generators in TIQA-Images. In best-of-5 AI-generated image ranking, ANTIQA improves the text quality of the selected image by 0.36 MOS (14%), demonstrating utility for benchmarking, filtering, and generation-time selection. Together, these findings establish perceptual text quality as a distinct evaluation target for modern text-to-image generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New task and datasets for scoring text quality in generated images, with a model that correlates well but rests on labels from a narrow set of generators.

read the letter

This paper defines a no-reference task for judging how well text renders in AI images, keeping it separate from semantic correctness. They release TIQA-Crops with 120k crops and 10k human MOS labels from 12 generators, plus TIQA-Images with 1500 full images from 10 more, and train ANTIQA to predict perceptual scores. The model reaches PLCC/SROCC of 0.942/0.935 on crops and 0.842/0.837 on unseen generators, and lifts text quality by 0.36 MOS in a best-of-5 selection test. That is the concrete advance: a targeted metric plus data that lets people filter or rank outputs for text legibility without retraining everything from scratch. The correlations look strong and the ranking result shows a usable gain. The soft spot is the generator coverage. All the labels come from just 22 systems total, so the human scores may miss new artifact types that later models produce. The abstract gives no error bars, no ablation details, and no word on dataset release, which leaves the numbers harder to trust at face value. The generalization claim therefore depends on an assumption that current failure modes are representative, and nothing in the reported tests checks that directly. This is for teams working on text-to-image generation who need a quick way to score or select for text quality. A reader already running perceptual evaluations or generation pipelines would find the datasets and predictor immediately relevant if they ship. It deserves peer review because the task is well-posed, the data effort is real, and the results are usable even if more generator diversity and robustness checks would make the claims tighter.

Referee Report

3 major / 1 minor

Summary. The paper introduces the TIQA no-reference task for estimating human-aligned perceptual quality scores of text regions in AI-generated images, disentangling visual quality from semantics. It releases two datasets—TIQA-Crops (120k crops from 36k images by 12 generators, with 10k MOS labels) and TIQA-Images (1,500 text-heavy images from 10 generators with paired scores)—and proposes the lightweight ANTIQA predictor with text-specific biases. ANTIQA reports PLCC/SROCC of 0.942/0.935 on TIQA-Crops and 0.842/0.837 on text-quality MOS for unseen generators in TIQA-Images, plus a 0.36 MOS (14%) improvement in best-of-5 ranking.

Significance. If the correlations and ranking gains hold under verification, the work supplies a practical perceptual metric for a known failure mode in text-to-image models. The datasets and ANTIQA could support benchmarking, filtering, and generation-time selection; the independent MOS collection avoids obvious circularity and the unseen-generator split provides a basic test of transfer.

major comments (3)

[Abstract] Abstract: the headline PLCC/SROCC figures (0.942/0.935 and 0.842/0.837) are given without error bars, standard deviations across folds, or statistical significance tests against baselines, so it is impossible to judge whether the claimed superiority is reliable or within noise.
[Abstract] Abstract: the central generalization claim—that ANTIQA works “across future generators”—rests on the untested assumption that the 10k MOS labels from only 12+10 generators already cover the space of human-perceptible text artifacts; no experiments probe novel glyph statistics, post-2024 diffusion artifacts, or new architectures that could shift the perceptual mapping.
[Abstract] Abstract: no ablation results are reported for the text-specific inductive biases in ANTIQA, so the contribution of each design choice to the reported correlations cannot be isolated and the model’s claimed lightness and specificity remain unverified.

minor comments (1)

[Abstract] The abstract should explicitly state whether the datasets will be publicly released, as this directly affects reproducibility of the 10k MOS labels and the reported numbers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We will revise the manuscript to strengthen the statistical reporting, clarify the generalization claims, and add ablation studies as detailed below.

read point-by-point responses

Referee: [Abstract] Abstract: the headline PLCC/SROCC figures (0.942/0.935 and 0.842/0.837) are given without error bars, standard deviations across folds, or statistical significance tests against baselines, so it is impossible to judge whether the claimed superiority is reliable or within noise.

Authors: We agree that error bars, standard deviations across folds, and statistical significance tests are necessary to substantiate the superiority claims. In the revised version, we will report mean PLCC/SROCC with standard deviations computed over multiple cross-validation folds and include pairwise statistical significance tests (e.g., Steiger's test or bootstrap confidence intervals) against the baselines. revision: yes
Referee: [Abstract] Abstract: the central generalization claim—that ANTIQA works “across future generators”—rests on the untested assumption that the 10k MOS labels from only 12+10 generators already cover the space of human-perceptible text artifacts; no experiments probe novel glyph statistics, post-2024 diffusion artifacts, or new architectures that could shift the perceptual mapping.

Authors: The reported generalization is supported by the explicit unseen-generator split in TIQA-Images (10 generators held out from training), which demonstrates transfer to new generator families. We acknowledge that exhaustive coverage of all possible future artifacts is impossible and will revise the abstract wording to specify 'unseen generators' rather than 'future generators' while adding a limitations paragraph discussing potential shifts in perceptual mappings from novel architectures. revision: partial
Referee: [Abstract] Abstract: no ablation results are reported for the text-specific inductive biases in ANTIQA, so the contribution of each design choice to the reported correlations cannot be isolated and the model’s claimed lightness and specificity remain unverified.

Authors: We agree that ablations are required to isolate the impact of the text-specific inductive biases. The revised manuscript will include a dedicated ablation study removing or replacing each bias component (e.g., glyph-aware convolutions, spacing priors) and reporting the resulting drops in PLCC/SROCC on both TIQA-Crops and TIQA-Images. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper collects independent human MOS labels on TIQA-Crops (10k labels) and TIQA-Images (1,500 images) from separate generators, then trains ANTIQA on those labels and reports correlation on held-out crops and unseen-generator images. No equations, self-citations, or fitted parameters are presented as independent predictions; the reported PLCC/SROCC values and MOS improvement are direct empirical measurements against external human judgments. The derivation remains self-contained against the collected benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human MOS labels collected on the described crops and images constitute a reliable, generalizable ground truth for perceptual text quality; no free parameters or invented entities are visible in the abstract.

axioms (1)

domain assumption Human mean-opinion scores collected on the 10k labeled crops and 1,500 images are stable and representative of general perceptual text quality.
Invoked implicitly when the paper treats the collected MOS as the target for training and evaluation.

pith-pipeline@v0.9.0 · 5597 in / 1336 out tokens · 28845 ms · 2026-05-15T14:29:16.888785+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 8 internal anchors

[1]

Stability AI. 2022. DeepFloyd IF (IF-I-M and related checkpoints). https:// stability.ai/news/deepfloyd-if-text-to-image-model Model card

work page 2022
[2]

Stability AI. 2022. Stable Diffusion v2.1. https://huggingface.co/qualcomm/ Stable-Diffusion-v2.1 Model card. Accessed: 2026-01-29

work page 2022
[3]

Stability AI. 2024. Stable Diffusion 3 Medium. https://huggingface.co/stabilityai/ stable-diffusion-3-medium Model card / release. Accessed: 2026-01-29

work page 2024
[4]

Stability AI. 2024. Stable Diffusion 3 Medium (announcement). https://stability. ai/news/stable-diffusion-3-medium Blog / release. Accessed: 2026-01-29

work page 2024
[5]

Stability AI. 2024. Stable Diffusion 3.5 Large. https://huggingface.co/stabilityai/ stable-diffusion-3.5-large Model card. Accessed: 2026-01-29

work page 2024
[6]

Stability AI. 2024. Stable Diffusion 3.5 Large Turbo. https://huggingface.co/ stabilityai/stable-diffusion-3.5-large-turbo Model card / release. Accessed: 2026- 01-29

work page 2024
[7]

Qwen Team / Alibaba. 2025. Qwen-Image (model repo / release). https://github. com/QwenLM/Qwen-Image Model repo / release. Accessed: 2026-01-29

work page 2025
[8]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, et al. 2025. Qwen3-VL Technical Report. arXiv:2...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Zenab Bosheah and Vilmos Bilicki. 2025. Challenges in Generating Accurate Text in Images: A Benchmark for Text-to-Image Models on Specialized Content. Applied Sciences15, 5 (2025), 2274

work page 2025
[10]

Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. 2024. Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images. InForty-first International Conference on Machine Learning

work page 2024
[11]

Chaofeng Chen, Jiadi Mo, Jingwen Hou, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, and Weisi Lin. 2024. TOPIQ: A Top-Down Approach From Seman- tics to Distortions for Image Quality Assessment.IEEE Transactions on Image Processing33 (2024), 2404–2418. doi:10.1109/TIP.2024.3378466

work page doi:10.1109/tip.2024.3378466 2024
[12]

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhong- dao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. 2024. PixArt -𝜎: Weak- to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation. https://arxiv.org/abs/2403.04692 arXiv preprint / project page. Accessed: 2026- 01-29

work page arXiv 2024
[13]

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei

work page
[14]

Textdiffuser: Diffusion models as text painters.Advances in Neural Infor- mation Processing Systems36 (2023), 9353–9387

work page 2023
[15]

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhong- dao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. 2023. PixArt-𝛼: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthe- sis. https://arxiv.org/abs/2310.00426 arXiv preprint. Accessed: 2026-01-29

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

EM Colombo, CF Kirschbaum, and M Raitelli. 1987. Legibility of texts: The influence of blur.Lighting Research & Technology19, 3 (1987), 61–71

work page 1987
[17]

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. 2025. PaddleOCR 3.0 Technical Report. arXiv:2507.05595 [cs.CV] https://arxiv.org/abs/2507.05595

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Google DeepMind. 2025. Imagen 4 (Imagen 4 Fast) — Google / DeepMind. https://deepmind.google/models/imagen/ Model page / announcement. Accessed: 2026-01-29

work page 2025
[19]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[20]

Luca Eyring, Shyamgopal Karthik, Karsten Roth, Alexey Dosovitskiy, and Zeynep Akata. 2024. Reno: Enhancing one-step text-to-image models through reward- based noise optimization.Advances in Neural Information Processing Systems37 (2024), 125487–125519

work page 2024
[21]

Forouzan Fallah, Maitreya Patel, Agneet Chatterjee, Vlad Morariu, Chitta Baral, and Yezhou Yang. 2025. Textinvision: Text and prompt complexity driven vi- sual text generation benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 525–534

work page 2025
[22]

Black Forest Labs (FLUX). 2026. FLUX.1 [dev] (model card). https://huggingface. co/black-forest-labs/FLUX.1-dev Model card. Accessed: 2026-01-29

work page 2026
[23]

Google. 2025. Nano Banana Pro (Gemini 3 Pro Image). https://blog.google/ innovation-and-ai/products/nano-banana-pro/ Product blog. Accessed: 2026-01- 29

work page 2025
[24]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, and Jian Guo. 2024. A Survey on LLM-as-a-Judge.arXiv preprint arXiv: 2411.15594(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 770–778

work page 2016
[26]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems30 (2017)

work page 2017
[27]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-Excitation Networks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7132–7141. doi:10.1109/CVPR.2018.00745

work page doi:10.1109/cvpr.2018.00745 2018
[28]

Ideogram. 2025. Ideogram 3 (Ideogram v3 / v3 turbo). https://ideogram.ai/ features/3.0 Product / model page. Accessed: 2026-01-29

work page 2025
[29]

JaidedAI. [n. d.]. EasyOCR. https://github.com/JaidedAI/EasyOCR. Accessed: 2026-03-12

work page 2026
[30]

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. 2023. Pick-a-pic: An open dataset of user preferences for text-to- image generation.Advances in neural information processing systems36 (2023), 36652–36663

work page 2023
[31]

Alexander Korotin, Daniil Selikhanovych, and Evgeny Burnaev. 2022. Neural optimal transport.arXiv preprint arXiv:2201.12220(2022)

work page arXiv 2022
[32]

Black Forest Labs. 2024. FLUX1.1 Pro (product / model page). https://bfl.ai/ models/flux-pro Vendor model page. Accessed: 2026-01-29

work page 2024
[33]

Black Forest Labs. 2025. FLUX.2 [max] (model / product page). https://replicate. com/black-forest-labs/flux-2-max Model / API page. Accessed: 2026-01-29

work page 2025
[34]

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. 2023. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Chunyi Li, Tengchuan Kou, Yixuan Gao, Yuqin Cao, Wei Sun, Zicheng Zhang, Yingjie Zhou, Zhichao Zhang, Weixia Zhang, Haoning Wu, et al. 2024. Aigiqa- 20k: A large database for ai-generated image quality assessment. InIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 6327–6336

work page 2024
[36]

Hui Li, Peng Wang, Chunhua Shen, and Guyu Zhang. 2019. Show, attend and read: a simple and strong baseline for irregular text recognition. InPro- ceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artifici...

work page doi:10.1609/aaai.v33i01.33018610 2019
[37]

Xiongkuo Min, Ke Gu, Guangtao Zhai, Xiaokang Yang, Wenjun Zhang, Patrick Le Callet, and Chang Wen Chen. 2021. Screen content quality assessment: Overview, benchmark, and beyond.ACM Computing Surveys (CSUR)54, 9 (2021), 1–36

work page 2021
[38]

Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. 2012. No- reference image quality assessment in the spatial domain.IEEE Transactions on image processing21, 12 (2012), 4695–4708

work page 2012
[39]

Novita. 2024. Novita AI. https://novita.ai. Accessed: 2024-03-01

work page 2024
[40]

OpenAI. 2025. ChatGPT Images (image generation) — OpenAI. https://help. openai.com/en/articles/8932459-creating-images-in-chatgpt Docs / feature page. Accessed: 2026-01-29

work page arXiv 2025
[41]

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. SDXL: Improving Latent Diffusion Models for High-Fidelity Image Generation. https://arxiv.org/abs/ 2307.01952 arXiv preprint. Accessed: 2026-01-29

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Anton Razzhigaev, Arseniy Shakhmatov, Anastasia Maltseva, Vladimir Arkhip- kin, Igor Pavlov, Ilya Ryabov, Angelina Kuts, Alexander Panchenko, Andrey Kuznetsov, and Denis Dimitrov. 2023. Kandinsky: An Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion (Kandinsky 2). https://arxiv. org/abs/2310.03502 arXiv preprint. Accessed: 2026-01-29

work page arXiv 2023
[43]

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans.Advances in neural information processing systems29 (2016)

work page 2016
[44]

Georgia Gabriela Sampaio, Ruixiang Zhang, Shuangfei Zhai, Jiatao Gu, Josh Susskind, Navdeep Jaitly, and Yizhe Zhang. 2024. Typescore: A text fidelity metric for text-to-image generative models.arXiv preprint arXiv:2411.02437 9 Kirill Koltsov, Aleksandr Gushchin, Anastasia Antsiferova, and Dmitriy Vatolin (2024)

work page arXiv 2024
[45]

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. 2022. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems 35 (2022), 25278–25294

work page 2022
[46]

ByteDance / Seedream. 2025. Seedream 4.5 (ByteDance / Seedream). https: //byteplus.com/en/product/Seedream Product / API page. Accessed: 2026-01-29

work page 2025
[47]

Wataru Shimoda, Naoto Inoue, Daichi Haraguchi, Hayato Mitani, Seiichi Uchida, and Kota Yamaguchi. 2025. Type-r: Automatically retouching typos for text-to- image generation. InProceedings of the Computer Vision and Pattern Recognition Conference. 2745–2754

work page 2025
[48]

Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang. 2020. Blindly assess image quality in the wild guided by a self-adaptive hyper network. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3667–3676

work page 2020
[49]

Hossein Talebi and Peyman Milanfar. 2018. NIMA: Neural image assessment. IEEE transactions on image processing27, 8 (2018), 3998–4011

work page 2018
[50]

RapidAI Team. 2021. Rapid OCR: OCR Toolbox. https://github.com/RapidAI/ RapidOCR

work page 2021
[51]

V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Haochen Li, Jiale Zhu, Jiali Chen, Ji...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

Tongyi-MAI. 2025. Z-Image-Turbo (Tongyi-MAI / Alibaba). https://huggingface. co/Tongyi-MAI/Z-Image-Turbo Model card / repo. Accessed: 2026-01-29

work page 2025
[55]

Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie

work page
[56]

Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054(2023)

work page arXiv 2023
[57]

UC Berkeley. n.d.. LMArena. https://lmarena.ai/leaderboard/text-to-image

work page
[58]

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. 2024. OmniGen: Unified Image Generation. https://arxiv.org/abs/2409.11340 arXiv preprint / project repo. Accessed: 2026-01-29

work page arXiv 2024
[59]

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. 2023. Imagereward: Learning and evaluating human prefer- ences for text-to-image generation.Advances in Neural Information Processing Systems36 (2023), 15903–15935

work page 2023
[60]

Yandex. n.d.. Yandex.Tasks. https://tasks.yandex.com. Accessed 20 December 2025

work page 2025
[61]

Peng Ye and David Doermann. 2013. Document image quality assessment: A brief survey. In2013 12th International Conference on Document Analysis and Recognition. IEEE, 723–727

work page 2013
[62]

Xingsong Ye, Yongkun Du, Yunbo Tao, and Zhineng Chen. 2025. Textssr: Diffusion-based data synthesis for scene text recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision. 17464–17473

work page 2025
[63]

Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. 2025. Teaching Large Language Models to Regress Accurate Image Quality Scores using Score Distribution. InIEEE/CVF Conference on Computer Vision and Pattern Recognition. 14483–14494

work page 2025
[64]

Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, and Chao Dong

work page
[65]

InEuropean Conference on Computer Vision

Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models. InEuropean Conference on Computer Vision. 259– 276

work page
[66]

Zai-Org. 2025. CogView4 (repo / model). https://github.com/zai-org/CogView4 Project / model release. Accessed: 2026-01-29

work page 2025
[67]

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

work page
[68]

InProceedings of the IEEE conference on computer vision and pattern recognition

The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition. 586–595

work page
[69]

Tianyu Zhang, Xinyu Wang, Zhenghan Tai, Lu Li, Jijun Chi, Jingrui Tian, Hailin He, and Suyuchen Wang. 2025. STRICT: Stress Test of Rendering Images Con- taining Text.arXiv preprint arXiv:2505.18985(2025)

work page arXiv 2025
[70]

Zicheng Zhang, Tengchuan Kou, Shushi Wang, Chunyi Li, Wei Sun, Wei Wang, Xiaoyu Li, Zongyu Wang, Xuezhi Cao, Xiongkuo Min, et al. 2025. Q-eval-100k: Evaluating visual quality and alignment level for text-to-vision content. InPro- ceedings of the Computer Vision and Pattern Recognition Conference. 10621–10631

work page 2025
[71]

no text detected

Hanwei Zhu, Haoning Wu, Yixuan Li, Zicheng Zhang, Baoliang Chen, Lingyu Zhu, Yuming Fang, Guangtao Zhai, Weisi Lin, and Shiqi Wang. 2024. Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=mHtOyh5taj 10 TIQA: Human...

work page 2024