FontFusion: Enhancing Generative Text in Diffusion Models with Typographic Conditioning

Ionut Mironica; Marian Lupascu; Nipun Jindal; Zhaowen Wang

arxiv: 2606.06066 · v1 · pith:EA2CO25Gnew · submitted 2026-06-04 · 💻 cs.CV · cs.GR

FontFusion: Enhancing Generative Text in Diffusion Models with Typographic Conditioning

Marian Lupascu , Nipun Jindal , Ionut Mironica , Zhaowen Wang This is my paper

Pith reviewed 2026-06-28 02:04 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords diffusion modelstypography generationfont conditioningDiTtext generationimage synthesisgenerative models

0 comments

The pith

FontFusion resolves the font control versus legibility trade-off in diffusion models using dual encoders and three conditioning techniques.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix the problem where stronger font styling in AI image generators often makes the generated text unreadable. FontFusion adds a conditioning layer to Diffusion Transformer models that connects text content to font styles through multiple levels of tokens, ties those styles to specific image positions, and drops some tokens at different scales. A dual-encoder setup that pairs DeepFont with DINOv2 is shown to work better than any single encoder. The approach reports 76 percent relative gains on hard decorative fonts compared with single-encoder baselines and 68 to 76 percent better font consistency than models without any font conditioning. It plugs into existing DiT models without requiring retraining of the base network.

Core claim

FontFusion is a plug-and-play conditioning framework for DiT architectures that resolves the trade-off between precise font control and text legibility through a hierarchical token representation establishing explicit text-font relationships at multiple granularities, position-aware embeddings creating spatial bindings between typography and image content, and a multi-level token dropping strategy improving efficiency and generalization to unseen fonts, with systematic evaluation showing that a dual encoder combining DeepFont and DINOv2 outperforms single encoders.

What carries the argument

Hierarchical token representation that establishes explicit text-font relationships at multiple granularities, augmented by position-aware embeddings and multi-level token dropping.

If this is right

FontFusion integrates directly into existing DiT architectures without retraining the base model.
A dual encoder combining DeepFont and DINOv2 outperforms any single encoder for typography tasks.
The method produces 76 percent relative improvement on challenging decorative fonts over single-encoder baselines.
Font consistency gains exceed approximately 68-76 percent over unconditioned models.
Multi-level token dropping improves both computational efficiency and generalization to unseen fonts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hierarchical and position-aware conditioning pattern could be tested on non-DiT diffusion backbones to check transferability.
Design software could adopt the dual-encoder plus token hierarchy to let users specify exact fonts in generated images.
Further work might measure whether the token-dropping schedule also reduces training time when the method is used from scratch.
Extending the dual encoder with additional specialized font models might increase consistency on rare script styles.

Load-bearing premise

The measured gains arise from the dual-encoder choice and the three listed conditioning components rather than from dataset-specific tuning or metric selection that would not hold on other data.

What would settle it

Running the same models on a fresh collection of decorative fonts never seen during development and measuring whether the 76 percent relative improvement and legibility scores remain stable without any hyperparameter changes.

Figures

Figures reproduced from arXiv: 2606.06066 by Ionut Mironica, Marian Lupascu, Nipun Jindal, Zhaowen Wang.

**Figure 1.** Figure 1: FontFusion Architecture. Multi-modal encoding (text, image, font via DeepFont+DINOv2, identity) feeds into hierarchical token organization combining conditioning and VAE tokens. Position-aware embeddings and DiT processing with cosine attention scaling generate typographically-controlled images. The modular design enables font conditioning during inference without base DiT modifications. 4.1 Hierarchica… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison for two reference typefaces: Carol Gothic (decorative bold, left) and Parfumerie Script (italic script, right), over five diverse prompts. The first four rows demonstrate FontFusion’s plug-and-play capability directly on openweight models: FLUX.1 Kontext [14] (labelled “Flux1 Kontext” in figure rows) and FLUX.1 [dev] [2] (labelled “Flux2” in figure rows) are each shown without and t… view at source ↗

read the original abstract

Typography generation in diffusion models faces a persistent trade-off: enabling precise font control typically degrades text legibility, while maintaining readability often sacrifices typographic fidelity. We present FontFusion, a plug-and-play conditioning framework for Diffusion Transformer (DiT) architectures that resolves this dilemma through three core innovations: (1) a hierarchical token representation establishing explicit text-font relationships at multiple granularities, (2) position-aware embeddings creating spatial bindings between typography and image content, and (3) a multi-level token dropping strategy improving both computational efficiency and generalization to unseen fonts. Our systematic evaluation of font embedding spaces reveals that a dual encoder combining DeepFont and DINOv2 outperforms any single encoder for typography tasks. FontFusion demonstrates 76% relative improvement on challenging decorative fonts over single-encoder baselines and font consistency gains exceeding approximately 68-76% over unconditioned models, while integrating into existing DiT architectures without retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FontFusion adds three specific conditioning components to DiT models for better font control and reports large gains, but the abstract supplies no experimental details to assess those numbers.

read the letter

The main thing to know is that this paper introduces FontFusion as a plug-and-play conditioning method for Diffusion Transformers that tries to keep both font accuracy and text readability. It lists three concrete pieces: hierarchical tokens for text-font relations at different levels, position-aware embeddings to link typography to image content, and multi-level token dropping for speed and better handling of new fonts. It also finds that a dual encoder with DeepFont and DINOv2 beats single encoders on typography tasks.

The work does a straightforward job naming a practical tension that shows up when people try to generate images with styled text. The plug-and-play claim and the focus on unseen fonts are reasonable angles if the full experiments back them.

The soft spot is the evaluation. The abstract states 76% relative improvement on decorative fonts over single-encoder baselines and 68-76% consistency gains over unconditioned models, yet it gives no metric definitions, no baseline list, no dataset details, and no error bars. Without those, the size of the gains is hard to judge and could shift with different choices. The dual-encoder result is presented as systematic, but again the abstract does not show the protocol.

This paper would mainly interest people working on text rendering inside generative image models, especially those building tools for design or creative workflows. If the full manuscript includes clear, reproducible comparisons and ablations, the ideas could be worth following.

I would send it to peer review. The framing and the listed components give enough structure that a referee could check whether the numbers hold once the experimental setup is visible.

Referee Report

1 major / 1 minor

Summary. The paper introduces FontFusion, a plug-and-play conditioning framework for Diffusion Transformer (DiT) architectures to resolve the trade-off between precise font control and text legibility in typography generation. It proposes three core innovations: (1) a hierarchical token representation for explicit text-font relationships at multiple granularities, (2) position-aware embeddings for spatial bindings between typography and image content, and (3) a multi-level token dropping strategy for efficiency and generalization. The paper reports that a dual encoder combining DeepFont and DINOv2 outperforms single encoders, with FontFusion achieving 76% relative improvement on challenging decorative fonts over single-encoder baselines and font consistency gains of approximately 68-76% over unconditioned models, while integrating into existing DiT architectures without retraining.

Significance. If the quantitative claims hold under rigorous, reproducible evaluation, FontFusion would address a practically important limitation in controllable text generation within diffusion models. The plug-and-play design and dual-encoder insight could enable broader adoption in DiT-based systems. The identification of multi-granularity conditioning as a route to balancing fidelity and legibility is a potentially useful direction, though its impact depends on whether the gains generalize beyond the reported settings.

major comments (1)

[Abstract] Abstract: the central claims of '76% relative improvement on challenging decorative fonts over single-encoder baselines' and 'font consistency gains exceeding approximately 68-76% over unconditioned models' are presented without any experimental protocol, metric definitions (e.g., how consistency or legibility is quantified), baseline implementation details, dataset descriptions, or error bars. This prevents assessment of whether the data-to-claim link is sound and makes the reported gains impossible to verify or reproduce from the given information.

minor comments (1)

[Abstract] Abstract: the phrasing 'exceeding approximately 68-76%' is imprecise and should be replaced with an exact reported range or per-metric values once the full evaluation section is available.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. The concern regarding the abstract is addressed below by clarifying that detailed experimental information is provided in the full manuscript, allowing verification of the reported results.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of '76% relative improvement on challenging decorative fonts over single-encoder baselines' and 'font consistency gains exceeding approximately 68-76% over unconditioned models' are presented without any experimental protocol, metric definitions (e.g., how consistency or legibility is quantified), baseline implementation details, dataset descriptions, or error bars. This prevents assessment of whether the data-to-claim link is sound and makes the reported gains impossible to verify or reproduce from the given information.

Authors: We agree that the abstract, as a concise summary, does not include full experimental protocols, metric definitions, baseline details, dataset descriptions, or error bars. These elements are provided in the manuscript's Experiments section (including quantitative evaluation protocols using the dual DeepFont+DINOv2 encoder, consistency metrics based on perceptual similarity and legibility scores, baseline DiT implementations, typography datasets, and reported standard deviations). The 76% relative improvement on decorative fonts and 68-76% consistency gains are derived from these systematic comparisons, enabling assessment of the data-to-claim link from the full paper rather than the abstract alone. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided abstract and context describe three conditioning innovations and dual-encoder evaluation results without equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations. Claims rest on reported empirical gains (e.g., 76% relative improvement) over baselines rather than any derivation that reduces to its own inputs by construction. The paper's chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all technical content remains at the level of high-level component names.

pith-pipeline@v0.9.1-grok · 5696 in / 1132 out tokens · 27294 ms · 2026-06-28T02:04:03.594999+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 23 canonical work pages · 3 internal anchors

[1]

ACM Trans

Berio,D.,Leymarie,F.F.,Asente,P.,Echevarria,J.:Strokestyles:Stroke-basedseg- mentation and stylization of fonts. ACM Trans. Graph.41(3), 28:1–28:21 (2022). https://doi.org/10.1145/3505246

work page doi:10.1145/3505246 2022
[2]

Black Forest Labs: FLUX.1: A family of flow-based text-to-image models.https: //blackforestlabs.ai/announcing-black-forest-labs/(2024)

2024
[3]

ACM Trans

Campbell, N.D.F., Kautz, J.: Learning a manifold of fonts. ACM Trans. Graph. 33(4), 91:1–91:11 (2014). https://doi.org/10.1145/2601097.2601212

work page doi:10.1145/2601097.2601212 2014
[4]

arXiv preprint arXiv:2305.10855 (2023)

Chen, J., Huang, Y., Lv, T., Cui, L., Chen, Q., Wei, F.: Textdiffuser: Diffusion models as text painters. arXiv preprint arXiv:2305.10855 (2023)

arXiv 2023
[5]

In: Proceedings of the 41st International Conference on Machine Learning

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Rom- bach, R.: Scaling rectified flow transformers for high-resolution image synthesis. In: Proceedings of the 41st International Conference on Machine Learning. ICML’24, JMLR.org (2024). https://doi...

work page doi:10.5555/3692070.3692573 2024
[6]

arXiv (2024)

Feng, K., Zhang, Y., Yu, H., Ji, Z., Bai, J., Zhang, H., Zuo, W.: Vitaglyph: Vital- izing artistic typography with flexible dual-branch diffusion models. arXiv (2024). https://doi.org/10.48550/ARXIV.2410.01738

work page doi:10.48550/arxiv.2410.01738 2024
[7]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image genera- tion using textual inversion (2022). https://doi.org/10.48550/ARXIV.2208.01618, https://arxiv.org/abs/2208.01618

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2208.01618 2022
[8]

Google: Gemini 2.0 Flash (Nano Banana).https://deepmind.google/models/ gemini-image/pro/(2025)

2025
[9]

He, H., Chen, X., Wang, C., Liu, J., Du, B., Tao, D., Qiao, Y.: Diff-font: Diffusion model for robust one-shot font generation. Int. J. Comput. Vis.132(11), 5372–5386 (2024). https://doi.org/10.1007/S11263-024-02137-0 FontFusion 11

work page doi:10.1007/s11263-024-02137-0 2024
[10]

(eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (2020)

Ho,J.,Jain,A.,Abbeel,P.:Denoisingdiffusionprobabilisticmodels.In:Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (2020). https://doi.org/10.5555/3495724.3496298

work page doi:10.5555/3495724.3496298 2020
[11]

Ideogram: Ideogram 3.0.https://ideogram.ai(2025)

2025
[12]

2023 , issue_date =

Iluz, S., Vinker, Y., Hertz, A., Berio, D., Cohen-Or, D., Shamir, A.: Word-as- image for semantic typography. ACM Trans. Graph.42(4), 151:1–151:11 (2023). https://doi.org/10.1145/3592123

work page doi:10.1145/3592123 2023
[13]

Imagen-Team-Google, :, Baldridge, J., Bauer, J., Bhutani, M., et al.: Imagen 3 (2024),https://arxiv.org/abs/2408.07009

arXiv 2024
[14]

Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

Pith/arXiv arXiv 2025
[15]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Liu, Z., Liang, W., Liang, Z., Luo, C., Li, J., Huang, G., Yuan, Y.: Glyph-byt5: A customized text encoder for accurate visual text rendering. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXV. Lec...

work page doi:10.1007/978-3-031-73226-3 2024
[16]

https://cdn.openai.com/11998be9-5319-4302-bfbf-1167e093f1fb/Native_ Image_Generation_System_Card.pdf(2025)

OpenAI: Addendum to GPT-4o System Card: Native image generation. https://cdn.openai.com/11998be9-5319-4302-bfbf-1167e093f1fb/Native_ Image_Generation_System_Card.pdf(2025)

2025
[17]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P., Li, S., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jégou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features...

work page doi:10.1109/cvpr46437.2021.00400 2024
[18]

Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. pp. 4172–4182. IEEE (2023). https://doi.org/10.1109/ICCV51070.2023.00387

work page doi:10.1109/iccv51070.2023.00387 2023
[19]

URLhttps://doi.org/10.5281/zenodo.5143773

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y. (eds.) International Conference on Learning Representations. vol. 2024, pp. 1862– 1874 (2024). https://doi....

work page doi:10.5281/zenodo.5143773 2024
[20]

In: Meila, M., Zhang, T

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual E...

2021
[21]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res.21, 140:1–140:67 (2020),https://jmlr.org/ papers/v21/20-074.html 12 M. Lupascu et al

2020
[22]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchi- cal text-conditional image generation with CLIP latents. arXiv (2022). https://doi.org/10.48550/ARXIV.2204.06125

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.06125 2022
[23]

In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High- resolution image synthesis with latent diffusion models. In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 10674–10685. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.01042

work page doi:10.1109/cvpr52688.2022.01042 2022
[24]

In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. pp. 22500–22510. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.02155

work page doi:10.1109/cvpr52729.2023.02155 2023
[25]

In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, S.K.S., Lopes, R.G., Ayan, B.K., Salimans, T., Ho, J., Fleet, D.J., Norouzi, M.: Photorealistic text-to-image diffusion models with deep language understanding. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Ad- vances in Neural Information Pro...

work page doi:10.5555/3600270.3602913 2022
[26]

arXiv (2024)

Shi, W., Song, Y., Zhang, D., Liu, J., Zou, X.: Fonts: Text rendering with typogra- phy and style controls. arXiv (2024). https://doi.org/10.48550/ARXIV.2412.00136

work page doi:10.48550/arxiv.2412.00136 2024
[27]

Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion

Tanveer, M., Wang, Y., Mahdavi-Amiri, A., Zhang, H.: Ds-fusion: Artistic typog- raphy via discriminated and stylized diffusion. In: IEEE/CVF International Con- ference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. pp. 374–384. IEEE (2023). https://doi.org/10.1109/ICCV51070.2023.00041

work page doi:10.1109/iccv51070.2023.00041 2023
[28]

Tatsukawa, Y., Shen, I., Qi, A., Koyama, Y., Igarashi, T., Shamir, A.: Fontclip: A semantic typography visual-language model for multilingual font applications. Comput. Graph. Forum43(2), i–iii (2024). https://doi.org/10.1111/CGF.15043

work page doi:10.1111/cgf.15043 2024
[29]

In: Zhou, X., Smeaton, A.F., Tian, Q., Bulterman, D.C.A., Shen, H.T., Mayer-Patel, K., Yan, S

Wang, Z., Yang, J., Jin, H., Shechtman, E., Agarwala, A., Brandt, J., Huang, T.S.: Deepfont: Identify your font from an image. In: Zhou, X., Smeaton, A.F., Tian, Q., Bulterman, D.C.A., Shen, H.T., Mayer-Patel, K., Yan, S. (eds.) Pro- ceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane, Australia, October 26 - 30, 2015. pp...

work page doi:10.1145/2733373.2806219 2015
[30]

In: Wooldridge, M.J., Dy, J.G., Natarajan, S

Yang, Z., Peng, D., Kong, Y., Zhang, Y., Yao, C., Jin, L.: Fontdiffuser: One- shot font generation via denoising diffusion with multi-scale content aggrega- tion and style contrastive learning. In: Wooldridge, M.J., Dy, J.G., Natarajan, S. (eds.) Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative App...

work page doi:10.1609/aaai.v38i7.28482 2024
[31]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv (2023). https://doi.org/10.48550/ARXIV.2308.06721

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.06721 2023
[32]

Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. pp. 3813–3824. IEEE (2023). https://doi.org/10.1109/ICCV51070.2023.00355

work page doi:10.1109/iccv51070.2023.00355 2023

[1] [1]

ACM Trans

Berio,D.,Leymarie,F.F.,Asente,P.,Echevarria,J.:Strokestyles:Stroke-basedseg- mentation and stylization of fonts. ACM Trans. Graph.41(3), 28:1–28:21 (2022). https://doi.org/10.1145/3505246

work page doi:10.1145/3505246 2022

[2] [2]

Black Forest Labs: FLUX.1: A family of flow-based text-to-image models.https: //blackforestlabs.ai/announcing-black-forest-labs/(2024)

2024

[3] [3]

ACM Trans

Campbell, N.D.F., Kautz, J.: Learning a manifold of fonts. ACM Trans. Graph. 33(4), 91:1–91:11 (2014). https://doi.org/10.1145/2601097.2601212

work page doi:10.1145/2601097.2601212 2014

[4] [4]

arXiv preprint arXiv:2305.10855 (2023)

Chen, J., Huang, Y., Lv, T., Cui, L., Chen, Q., Wei, F.: Textdiffuser: Diffusion models as text painters. arXiv preprint arXiv:2305.10855 (2023)

arXiv 2023

[5] [5]

In: Proceedings of the 41st International Conference on Machine Learning

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Rom- bach, R.: Scaling rectified flow transformers for high-resolution image synthesis. In: Proceedings of the 41st International Conference on Machine Learning. ICML’24, JMLR.org (2024). https://doi...

work page doi:10.5555/3692070.3692573 2024

[6] [6]

arXiv (2024)

Feng, K., Zhang, Y., Yu, H., Ji, Z., Bai, J., Zhang, H., Zuo, W.: Vitaglyph: Vital- izing artistic typography with flexible dual-branch diffusion models. arXiv (2024). https://doi.org/10.48550/ARXIV.2410.01738

work page doi:10.48550/arxiv.2410.01738 2024

[7] [7]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image genera- tion using textual inversion (2022). https://doi.org/10.48550/ARXIV.2208.01618, https://arxiv.org/abs/2208.01618

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2208.01618 2022

[8] [8]

Google: Gemini 2.0 Flash (Nano Banana).https://deepmind.google/models/ gemini-image/pro/(2025)

2025

[9] [9]

He, H., Chen, X., Wang, C., Liu, J., Du, B., Tao, D., Qiao, Y.: Diff-font: Diffusion model for robust one-shot font generation. Int. J. Comput. Vis.132(11), 5372–5386 (2024). https://doi.org/10.1007/S11263-024-02137-0 FontFusion 11

work page doi:10.1007/s11263-024-02137-0 2024

[10] [10]

(eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (2020)

Ho,J.,Jain,A.,Abbeel,P.:Denoisingdiffusionprobabilisticmodels.In:Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (2020). https://doi.org/10.5555/3495724.3496298

work page doi:10.5555/3495724.3496298 2020

[11] [11]

Ideogram: Ideogram 3.0.https://ideogram.ai(2025)

2025

[12] [12]

2023 , issue_date =

Iluz, S., Vinker, Y., Hertz, A., Berio, D., Cohen-Or, D., Shamir, A.: Word-as- image for semantic typography. ACM Trans. Graph.42(4), 151:1–151:11 (2023). https://doi.org/10.1145/3592123

work page doi:10.1145/3592123 2023

[13] [13]

Imagen-Team-Google, :, Baldridge, J., Bauer, J., Bhutani, M., et al.: Imagen 3 (2024),https://arxiv.org/abs/2408.07009

arXiv 2024

[14] [14]

Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

Pith/arXiv arXiv 2025

[15] [15]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Liu, Z., Liang, W., Liang, Z., Luo, C., Li, J., Huang, G., Yuan, Y.: Glyph-byt5: A customized text encoder for accurate visual text rendering. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXV. Lec...

work page doi:10.1007/978-3-031-73226-3 2024

[16] [16]

https://cdn.openai.com/11998be9-5319-4302-bfbf-1167e093f1fb/Native_ Image_Generation_System_Card.pdf(2025)

OpenAI: Addendum to GPT-4o System Card: Native image generation. https://cdn.openai.com/11998be9-5319-4302-bfbf-1167e093f1fb/Native_ Image_Generation_System_Card.pdf(2025)

2025

[17] [17]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P., Li, S., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jégou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features...

work page doi:10.1109/cvpr46437.2021.00400 2024

[18] [18]

Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. pp. 4172–4182. IEEE (2023). https://doi.org/10.1109/ICCV51070.2023.00387

work page doi:10.1109/iccv51070.2023.00387 2023

[19] [19]

URLhttps://doi.org/10.5281/zenodo.5143773

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y. (eds.) International Conference on Learning Representations. vol. 2024, pp. 1862– 1874 (2024). https://doi....

work page doi:10.5281/zenodo.5143773 2024

[20] [20]

In: Meila, M., Zhang, T

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual E...

2021

[21] [21]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res.21, 140:1–140:67 (2020),https://jmlr.org/ papers/v21/20-074.html 12 M. Lupascu et al

2020

[22] [22]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchi- cal text-conditional image generation with CLIP latents. arXiv (2022). https://doi.org/10.48550/ARXIV.2204.06125

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.06125 2022

[23] [23]

In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High- resolution image synthesis with latent diffusion models. In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 10674–10685. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.01042

work page doi:10.1109/cvpr52688.2022.01042 2022

[24] [24]

In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. pp. 22500–22510. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.02155

work page doi:10.1109/cvpr52729.2023.02155 2023

[25] [25]

In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, S.K.S., Lopes, R.G., Ayan, B.K., Salimans, T., Ho, J., Fleet, D.J., Norouzi, M.: Photorealistic text-to-image diffusion models with deep language understanding. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Ad- vances in Neural Information Pro...

work page doi:10.5555/3600270.3602913 2022

[26] [26]

arXiv (2024)

Shi, W., Song, Y., Zhang, D., Liu, J., Zou, X.: Fonts: Text rendering with typogra- phy and style controls. arXiv (2024). https://doi.org/10.48550/ARXIV.2412.00136

work page doi:10.48550/arxiv.2412.00136 2024

[27] [27]

Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion

Tanveer, M., Wang, Y., Mahdavi-Amiri, A., Zhang, H.: Ds-fusion: Artistic typog- raphy via discriminated and stylized diffusion. In: IEEE/CVF International Con- ference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. pp. 374–384. IEEE (2023). https://doi.org/10.1109/ICCV51070.2023.00041

work page doi:10.1109/iccv51070.2023.00041 2023

[28] [28]

Tatsukawa, Y., Shen, I., Qi, A., Koyama, Y., Igarashi, T., Shamir, A.: Fontclip: A semantic typography visual-language model for multilingual font applications. Comput. Graph. Forum43(2), i–iii (2024). https://doi.org/10.1111/CGF.15043

work page doi:10.1111/cgf.15043 2024

[29] [29]

In: Zhou, X., Smeaton, A.F., Tian, Q., Bulterman, D.C.A., Shen, H.T., Mayer-Patel, K., Yan, S

Wang, Z., Yang, J., Jin, H., Shechtman, E., Agarwala, A., Brandt, J., Huang, T.S.: Deepfont: Identify your font from an image. In: Zhou, X., Smeaton, A.F., Tian, Q., Bulterman, D.C.A., Shen, H.T., Mayer-Patel, K., Yan, S. (eds.) Pro- ceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane, Australia, October 26 - 30, 2015. pp...

work page doi:10.1145/2733373.2806219 2015

[30] [30]

In: Wooldridge, M.J., Dy, J.G., Natarajan, S

Yang, Z., Peng, D., Kong, Y., Zhang, Y., Yao, C., Jin, L.: Fontdiffuser: One- shot font generation via denoising diffusion with multi-scale content aggrega- tion and style contrastive learning. In: Wooldridge, M.J., Dy, J.G., Natarajan, S. (eds.) Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative App...

work page doi:10.1609/aaai.v38i7.28482 2024

[31] [31]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv (2023). https://doi.org/10.48550/ARXIV.2308.06721

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.06721 2023

[32] [32]

Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. pp. 3813–3824. IEEE (2023). https://doi.org/10.1109/ICCV51070.2023.00355

work page doi:10.1109/iccv51070.2023.00355 2023