StyleText: A Large-Scale Dataset and Benchmark for Stylized Scene Text Inpainting

Aleksandr Simonyan; Nipun Jindal

arxiv: 2605.17309 · v1 · pith:LQTGLFZKnew · submitted 2026-05-17 · 💻 cs.CV · cs.AI

StyleText: A Large-Scale Dataset and Benchmark for Stylized Scene Text Inpainting

Aleksandr Simonyan , Nipun Jindal This is my paper

Pith reviewed 2026-05-20 14:26 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords scene text inpaintingstylized textdataset creationimage generationOCR evaluationgenerative AIbenchmarking

0 comments

The pith

StyleText dataset of 28,518 triplets enables models to inpaint scene text while preserving style and improving legibility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents StyleText, a large-scale collection of image-mask-prompt triplets designed for training and testing models that fill in text within photographs without disrupting the surrounding visual style. The dataset is created through an automated process using language models to generate prompts, generative AI to create source images with special caching for consistency, optical character recognition to ensure semantic quality, and mask-based techniques to prepare training examples. It is organized into scene families to allow fair comparisons under the same background conditions. The authors also provide a standardized way to measure success using OCR accuracy for text readability and CLIP scores for style matching. Training a specific baseline model on this data results in better text recognition performance compared to starting from scratch, all while keeping the scene look intact.

Core claim

StyleText contains 28,518 image-mask-prompt triplets grouped into 9,932 scene families. It is constructed with an automated pipeline that combines LLM prompt templating, Flux-based source generation with key-value cache injection, OCR-based semantic filtering, polygon mask extraction, and mask-conditioned FluxFill augmentation. A FluxFill+LoRA baseline trained on StyleText improves OCR accuracy substantially over initialization while maintaining scene style consistency, establishing a strong reference point for future comparisons.

What carries the argument

The automated pipeline for dataset construction using LLM prompt templating combined with Flux-based image generation, KV cache injection for consistency, OCR filtering, and mask-conditioned augmentation.

If this is right

Improved OCR accuracy on inpainted text becomes achievable with models trained on this dataset.
Scene style consistency can be maintained during text inpainting as shown by the baseline.
The evaluation protocol with normalized OCR metrics and CLIP similarity allows reproducible comparisons.
Controlled testing within scene families reduces variability in assessments of model performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future inpainting methods could build on this data to handle more complex scenes or different languages.
The scene family structure might help in studying how context affects text style matching.
Integrating this benchmark with other vision tasks like object detection could reveal broader benefits.

Load-bearing premise

The automated pipeline produces high-quality data without systematic biases or artifacts that affect model training and evaluation.

What would settle it

If retraining the FluxFill+LoRA baseline on StyleText fails to improve OCR word accuracy or character error rate over the uninitialized model, or if style consistency drops in independent tests.

Figures

Figures reproduced from arXiv: 2605.17309 by Aleksandr Simonyan, Nipun Jindal.

**Figure 2.** Figure 2: Scene-group consistency example. The same visual [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Representative sample grid from StyleText, showing the diversity of visual backgrounds, typography styles, phrase lengths, and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: End-to-end StyleText generation pipeline. (a) Prompted source generation with Flux and StableFlow-style KV injection produces [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison before (top) and after (bottom) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

We present StyleText, a large-scale dataset and benchmark for localized scene-text inpainting with style preservation. StyleText contains 28,518 image-mask-prompt triplets grouped into 9,932 scene families, enabling controlled evaluation of text legibility and visual consistency under shared scene context. We construct the dataset with an automated pipeline that combines LLM prompt templating, Flux-based source generation with key-value (KV) cache injection, OCR-based semantic filtering, polygon mask extraction, and mask-conditioned FluxFill augmentation. We define a reproducible evaluation protocol using normalized OCR metrics (word accuracy and character error rate) and CLIP image-image similarity with explicit preprocessing. A FluxFill+LoRA baseline trained on StyleText improves OCR accuracy substantially over initialization while maintaining scene style consistency, establishing a strong reference point for future comparisons.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StyleText gives a new dataset and pipeline for stylized scene text inpainting with scene-family grouping, but the baseline gains rest on an automated construction that needs more checks for bias and actual numbers.

read the letter

The main point is that this paper ships a dataset of 28,518 triplets in 9,932 scene families for localized text inpainting that keeps scene style. The grouping by scene families is a practical step that lets future work test consistency under shared context rather than isolated images. Their automated pipeline combines LLM prompt templating, Flux generation with KV cache injection, OCR semantic filtering, polygon masks, and mask-conditioned FluxFill augmentation. That setup scales data creation without heavy manual labeling, which is the real engineering contribution here. They also lay out a clear evaluation protocol using normalized OCR word accuracy and character error rate plus CLIP image-image similarity with preprocessing steps spelled out. A FluxFill plus LoRA baseline trained on the data reportedly lifts OCR accuracy while holding style, giving a reference point others can compare against. The construction details and protocol are the parts that look solid and worth building on. The soft spots are in the results section. The abstract claims substantial OCR improvement but does not report the actual deltas, error bars, or ablation tables, so it is hard to judge how much of the gain comes from the data versus the model choice. The stress-test worry about Flux and OCR artifacts creating circular gains has some weight because the data generation and baseline both lean on the same model family; without external test sets or human validation of the triplets, it is possible the reported consistency is partly an artifact of the pipeline rather than proof of general inpainting progress. Minor issues include whether the filtering step drops hard stylized cases systematically. This work is aimed at computer vision researchers who need training data or benchmarks for text editing and inpainting tools. A reader focused on scene text or image manipulation would find the dataset and protocol useful even if they retrain their own models. It deserves peer review because new large-scale resources with controlled splits are worth the time to vet, especially on the bias and reproducibility questions.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces StyleText, a dataset of 28,518 image-mask-prompt triplets grouped into 9,932 scene families for stylized scene text inpainting. It is constructed via an automated pipeline using LLM prompt templating, Flux-based source generation with KV cache injection, OCR-based semantic filtering, polygon mask extraction, and mask-conditioned FluxFill augmentation. The authors define a reproducible evaluation protocol based on normalized OCR word accuracy, character error rate, and CLIP image-image similarity, and report that a FluxFill+LoRA baseline trained on the dataset substantially improves OCR accuracy over initialization while preserving scene style consistency.

Significance. If the dataset proves high-quality and free of systematic biases from the generation pipeline, this work would provide a valuable large-scale benchmark for localized text inpainting that preserves visual style, addressing a gap in existing scene-text datasets. The reproducible protocol and baseline reference point are positive contributions for future comparisons in computer vision.

major comments (1)

The central claim of dataset quality and unbiased evaluation rests on the automated pipeline (LLM templating + Flux generation + OCR filtering + FluxFill). The manuscript should include quantitative analysis showing that OCR filtering does not systematically drop hard stylized cases and that the resulting triplets do not correlate with Flux-family artifacts in a way that inflates the reported baseline OCR gains. Without such checks, the improvement may be partly circular rather than evidence of genuine inpainting progress.

minor comments (2)

The abstract states a 'substantial' OCR improvement but provides no numerical values, error bars, or ablation details; these should be added to the results section with explicit before/after metrics for the baseline.
Clarify the exact preprocessing steps and normalization procedure for the CLIP image-image similarity metric to ensure full reproducibility of the evaluation protocol.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential value of StyleText as a benchmark. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: The central claim of dataset quality and unbiased evaluation rests on the automated pipeline (LLM templating + Flux generation + OCR filtering + FluxFill). The manuscript should include quantitative analysis showing that OCR filtering does not systematically drop hard stylized cases and that the resulting triplets do not correlate with Flux-family artifacts in a way that inflates the reported baseline OCR gains. Without such checks, the improvement may be partly circular rather than evidence of genuine inpainting progress.

Authors: We agree that additional quantitative checks are warranted to strengthen claims about dataset quality and to rule out circularity in the reported baseline gains. In the revised manuscript we will add a dedicated subsection under Dataset Construction that reports: (1) a comparison of stylization difficulty proxies (OCR confidence distribution, background entropy, and font-variation entropy) before versus after the OCR semantic filter, demonstrating that the filter does not disproportionately remove low-confidence or highly stylized examples; (2) a correlation analysis between per-triplet baseline OCR improvement and measurable Flux-family artifacts (e.g., via LPIPS distance to nearest training image and visual artifact classifiers), showing that gains are not driven by artifact correlation. These analyses will be performed on the released dataset splits and the code will be updated to reproduce them. revision: yes

Circularity Check

0 steps flagged

No significant circularity in dataset construction or baseline evaluation

full rationale

The paper presents a dataset construction pipeline using external components (LLM templating, Flux generation with KV cache, OCR filtering, polygon masks, and FluxFill augmentation) and reports empirical results from training a FluxFill+LoRA baseline that improves OCR accuracy over initialization while preserving style consistency. No equations, fitted parameters, or self-citations are shown that reduce the reported improvements or evaluation metrics to inputs defined by the authors themselves. The evaluation protocol relies on standard normalized OCR and CLIP metrics with explicit preprocessing, making the central claims self-contained empirical observations rather than derivations that collapse by construction to the pipeline inputs or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the multi-stage automated generation and filtering pipeline yields clean, style-consistent training data and that the reported baseline improvement reflects genuine generalization rather than pipeline artifacts.

axioms (1)

domain assumption LLM prompt templating combined with Flux generation and OCR filtering produces realistic and diverse stylized scene text examples suitable for training.
Invoked in the dataset construction pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5668 in / 1278 out tokens · 43819 ms · 2026-05-20T14:26:29.532633+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We construct the dataset with an automated pipeline that combines LLM prompt templating, Flux-based source generation with key-value (KV) cache injection, OCR-based semantic filtering, polygon mask extraction, and mask-conditioned FluxFill augmentation.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A FluxFill+LoRA baseline trained on StyleText improves OCR accuracy substantially over initialization while maintaining scene style consistency

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

[1]

Bermano, and Tali Dekel

Omer Bar-Tal, Yoni Shalev, Ron Mokady, Amir Hertz, Amit H. Bermano, and Tali Dekel. Text2live: Text-driven layered image and video editing. InProceedings of the European Conference on Computer Vision (ECCV), 2022. 8

work page 2022
[2]

LED- ITS++: Limitless image editing using text-to-image models

Manuel Brack, Felix Friedrich, Dominik Hintersdorf, Lukas Struppek, Patrick Schramowski, and Kristian Kersting. LED- ITS++: Limitless image editing using text-to-image models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[3]

TextDiffuser: Diffusion models as text painters

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. TextDiffuser: Diffusion models as text painters. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[4]

Openhermes-2.5-mistral- 7b

OpenHermes Contributors. Openhermes-2.5-mistral- 7b. https : / / huggingface . co / openhermes / OpenHermes-2.5-Mistral-7B, 2024

work page 2024
[5]

Kv-edit: Editing images via dense key-value patching.arXiv preprint arXiv:2310.01850, 2023

Katherine Crowson, Sheng-Yu Zhai, Tu Nguyen, and Jascha Sohl-Dickstein. Kv-edit: Editing images via dense key-value patching.arXiv preprint arXiv:2310.01850, 2023

work page arXiv 2023
[6]

Pp-ocr: A practical ultra lightweight ocr system

Yuning Du, Yingying Xia, Shanjian Huang, Can Lin, Jiayi Yu, Yi Liu, Weining Zhou, Wei Xu, Xianwen Liu, Dacheng Liang, et al. Pp-ocr: A practical ultra lightweight ocr system. arXiv preprint arXiv:2009.09941, 2020

work page arXiv 2009
[7]

Synthetic data for text localisation in natural images

Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for text localisation in natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[8]

Bermano, Daniel Cohen-Or, and Tali Dekel

Amir Hertz, Ron Mokady, Guy Tevet, Rinon Gal, Amit H. Bermano, Daniel Cohen-Or, and Tali Dekel. Prompt-to- prompt image editing with cross attention control. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[9]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), pages 6840–6851, 2020

work page 2020
[10]

Arbitrary style transfer in real- time with adaptive instance normalization

Xun Huang and Serge Belongie. Arbitrary style transfer in real- time with adaptive instance normalization. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017

work page 2017
[11]

ICDAR 2015 competition on robust reading

Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nico- laou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwa- mura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chan- drasekhar, Shijian Lu, et al. ICDAR 2015 competition on robust reading. InProceedings of the International Confer- ence on Document Analysis and Recognition (ICDAR), pages 1156–1160, 2015

work page 2015
[12]

Gligen: Open- set grounded text-to-image generation

Xinyue Li, Yichi Zhang, Menglin Yang, Yixuan Chen, Yixiao Zhang, Jason Baldridge, and Saurabh Singh. Gligen: Open- set grounded text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18143–18153, 2023

work page 2023
[13]

Flux: Bridging trans- formers and diffusion models, 2024

Tom Lucas, Patrick von Platen, Clemens Meyer, Suraj Patil, Kashif Rasul, Lewis Tunstall, Sayak Paul, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Flux: Bridging trans- formers and diffusion models, 2024

work page 2024
[14]

Re- paint: Inpainting using denoising diffusion probabilistic mod- els

Andreas Lugmayr, Martin Danelljan, and Radu Timofte. Re- paint: Inpainting using denoising diffusion probabilistic mod- els. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[15]

Stableflow: Progressive flow-guided scene generation from text.arXiv preprint arXiv:2311.16466, 2024

Haonan Ma, Weijie Lin, Yuwei Zhang, Yuwei Ye, Ruijia Gao, and Ziwei Liu. Stableflow: Progressive flow-guided scene generation from text.arXiv preprint arXiv:2311.16466, 2024

work page arXiv 2024
[16]

DiffEditor: Boosting accuracy and flexibil- ity on diffusion-based image editing

Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and Jian Zhang. DiffEditor: Boosting accuracy and flexibil- ity on diffusion-based image editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[17]

GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishra, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. InProceedings of the International Conference on Machine Learning (ICML), pages 16784–16804, 2022

work page 2022
[18]

Learning transferable visual models from natural language supervision (CLIP)

Alec Radford, Jong Wook Kim, Christopher Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision (CLIP). InPro- ceedings of the International Conference on Machine Learning (ICML), 2021

work page 2021
[19]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022

work page 2022
[20]

Towards VQA models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8317–8326, 2019

work page 2019
[21]

Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[22]

AnyText: Multilingual visual text generation and editing

Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Gao, and Enze Xie. AnyText: Multilingual visual text generation and editing. InProceedings of the International Conference on Learning Representations (ICLR), 2024

work page 2024
[23]

COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

Andreas Veit, Tomas Matera, Luk ´aˇs Neumann, Jiri Matas, and Serge Belongie. COCO-Text: Dataset and benchmark for text detection and recognition in natural images. InarXiv preprint arXiv:1601.07140, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[24]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[25]

UDiffText: A unified framework for high-quality text synthesis in arbitrary images via character-aware diffusion models

Yiming Zhao and Zhouhui Lian. UDiffText: A unified framework for high-quality text synthesis in arbitrary images via character-aware diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 9

work page 2024

[1] [1]

Bermano, and Tali Dekel

Omer Bar-Tal, Yoni Shalev, Ron Mokady, Amir Hertz, Amit H. Bermano, and Tali Dekel. Text2live: Text-driven layered image and video editing. InProceedings of the European Conference on Computer Vision (ECCV), 2022. 8

work page 2022

[2] [2]

LED- ITS++: Limitless image editing using text-to-image models

Manuel Brack, Felix Friedrich, Dominik Hintersdorf, Lukas Struppek, Patrick Schramowski, and Kristian Kersting. LED- ITS++: Limitless image editing using text-to-image models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[3] [3]

TextDiffuser: Diffusion models as text painters

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. TextDiffuser: Diffusion models as text painters. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[4] [4]

Openhermes-2.5-mistral- 7b

OpenHermes Contributors. Openhermes-2.5-mistral- 7b. https : / / huggingface . co / openhermes / OpenHermes-2.5-Mistral-7B, 2024

work page 2024

[5] [5]

Kv-edit: Editing images via dense key-value patching.arXiv preprint arXiv:2310.01850, 2023

Katherine Crowson, Sheng-Yu Zhai, Tu Nguyen, and Jascha Sohl-Dickstein. Kv-edit: Editing images via dense key-value patching.arXiv preprint arXiv:2310.01850, 2023

work page arXiv 2023

[6] [6]

Pp-ocr: A practical ultra lightweight ocr system

Yuning Du, Yingying Xia, Shanjian Huang, Can Lin, Jiayi Yu, Yi Liu, Weining Zhou, Wei Xu, Xianwen Liu, Dacheng Liang, et al. Pp-ocr: A practical ultra lightweight ocr system. arXiv preprint arXiv:2009.09941, 2020

work page arXiv 2009

[7] [7]

Synthetic data for text localisation in natural images

Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for text localisation in natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016

[8] [8]

Bermano, Daniel Cohen-Or, and Tali Dekel

Amir Hertz, Ron Mokady, Guy Tevet, Rinon Gal, Amit H. Bermano, Daniel Cohen-Or, and Tali Dekel. Prompt-to- prompt image editing with cross attention control. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[9] [9]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), pages 6840–6851, 2020

work page 2020

[10] [10]

Arbitrary style transfer in real- time with adaptive instance normalization

Xun Huang and Serge Belongie. Arbitrary style transfer in real- time with adaptive instance normalization. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017

work page 2017

[11] [11]

ICDAR 2015 competition on robust reading

Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nico- laou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwa- mura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chan- drasekhar, Shijian Lu, et al. ICDAR 2015 competition on robust reading. InProceedings of the International Confer- ence on Document Analysis and Recognition (ICDAR), pages 1156–1160, 2015

work page 2015

[12] [12]

Gligen: Open- set grounded text-to-image generation

Xinyue Li, Yichi Zhang, Menglin Yang, Yixuan Chen, Yixiao Zhang, Jason Baldridge, and Saurabh Singh. Gligen: Open- set grounded text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18143–18153, 2023

work page 2023

[13] [13]

Flux: Bridging trans- formers and diffusion models, 2024

Tom Lucas, Patrick von Platen, Clemens Meyer, Suraj Patil, Kashif Rasul, Lewis Tunstall, Sayak Paul, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Flux: Bridging trans- formers and diffusion models, 2024

work page 2024

[14] [14]

Re- paint: Inpainting using denoising diffusion probabilistic mod- els

Andreas Lugmayr, Martin Danelljan, and Radu Timofte. Re- paint: Inpainting using denoising diffusion probabilistic mod- els. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022

[15] [15]

Stableflow: Progressive flow-guided scene generation from text.arXiv preprint arXiv:2311.16466, 2024

Haonan Ma, Weijie Lin, Yuwei Zhang, Yuwei Ye, Ruijia Gao, and Ziwei Liu. Stableflow: Progressive flow-guided scene generation from text.arXiv preprint arXiv:2311.16466, 2024

work page arXiv 2024

[16] [16]

DiffEditor: Boosting accuracy and flexibil- ity on diffusion-based image editing

Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and Jian Zhang. DiffEditor: Boosting accuracy and flexibil- ity on diffusion-based image editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[17] [17]

GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishra, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. InProceedings of the International Conference on Machine Learning (ICML), pages 16784–16804, 2022

work page 2022

[18] [18]

Learning transferable visual models from natural language supervision (CLIP)

Alec Radford, Jong Wook Kim, Christopher Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision (CLIP). InPro- ceedings of the International Conference on Machine Learning (ICML), 2021

work page 2021

[19] [19]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022

work page 2022

[20] [20]

Towards VQA models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8317–8326, 2019

work page 2019

[21] [21]

Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021

[22] [22]

AnyText: Multilingual visual text generation and editing

Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Gao, and Enze Xie. AnyText: Multilingual visual text generation and editing. InProceedings of the International Conference on Learning Representations (ICLR), 2024

work page 2024

[23] [23]

COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

Andreas Veit, Tomas Matera, Luk ´aˇs Neumann, Jiri Matas, and Serge Belongie. COCO-Text: Dataset and benchmark for text detection and recognition in natural images. InarXiv preprint arXiv:1601.07140, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[24] [24]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023

[25] [25]

UDiffText: A unified framework for high-quality text synthesis in arbitrary images via character-aware diffusion models

Yiming Zhao and Zhouhui Lian. UDiffText: A unified framework for high-quality text synthesis in arbitrary images via character-aware diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 9

work page 2024