ANCHOR: LLM-driven Subject Conditioning for Text-to-Image Synthesis

Aashish Anantha Ramakrishnan; Dongwon Lee; Sharon X. Huang

arxiv: 2404.10141 · v2 · submitted 2024-04-15 · 💻 cs.CV · cs.CL· cs.MM

ANCHOR: LLM-driven Subject Conditioning for Text-to-Image Synthesis

Aashish Anantha Ramakrishnan , Sharon X. Huang , Dongwon Lee This is my paper

Pith reviewed 2026-05-24 01:40 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.MM

keywords text-to-image synthesissubject conditioningLLM-driven enhancementimage-caption consistencyabstractive captionsANCHOR datasetSAFE methodmulti-subject prompts

0 comments

The pith

LLM-extracted key subjects enhanced at the embedding level improve text-to-image consistency on complex multi-subject captions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates ANCHOR, a dataset of more than 70,000 abstractive news captions, to expose where current text-to-image models fail on realistic prompts that contain multiple interacting subjects and contextual references. Analysis of these failures leads to Subject-Aware Fine-tuning (SAFE), a method that asks an LLM to identify the main subjects in a prompt and then strengthens their token embeddings before image synthesis. Experiments on existing models show gains in image-caption consistency and human preference scores. If the approach holds, it offers a lightweight way to make generation more reliable for detailed, real-world language without replacing the underlying model or encoder.

Core claim

Current image-text encoders struggle with multi-subject understanding, context reasoning, and nuanced grounding on abstractive captions. SAFE addresses this by using LLMs to extract key subjects and enhance their representation at the embedding level, producing measurable gains in consistency and preference alignment across contemporary text-to-image models.

What carries the argument

Subject-Aware Fine-tuning (SAFE), which extracts key subjects via LLM and boosts their embeddings for targeted conditioning.

If this is right

Text-to-image outputs align more closely with captions that contain several interacting subjects and contextual details.
Human raters prefer the generated images over baseline outputs on the same complex prompts.
The method works as a plug-in fine-tuning step on existing models without requiring changes to the base architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same subject-extraction step could be applied to other conditioning signals such as style or spatial layout.
Users might obtain better results on long, narrative prompts without manual rewriting.
If the gains hold across new models, the technique could become a standard preprocessing layer for production text-to-image systems.

Load-bearing premise

That the subjects identified by the LLM are the ones whose strengthened embeddings will raise overall consistency without harming other parts of the prompt or introducing new errors.

What would settle it

Running SAFE on a standard model and finding no improvement or a drop in consistency scores on the ANCHOR test set or on other multi-subject caption benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2404.10141 by Aashish Anantha Ramakrishnan, Dongwon Lee, Sharon X. Huang.

**Figure 2.** Figure 2: Overview of our dataset’s pre-processing and filtering steps [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our Subject-Aware FinE-tuning Approach (SAFE) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of different T2I models on [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: ANCHOR Distribution of Article Topics for samples in ANCHOR captions Abstractive in nature? We launched our survey with 150 unique participants and each participant rated 10 samples. The survey layout is presented in [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Survey UI for Data Quality Evaluation Study [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Survey UI for Generated Image Evaluation Study [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison of different T2I models on [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison of different T2I models on [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

read the original abstract

Text-to-image (T2I) models have achieved remarkable progress in high-quality image synthesis, yet most benchmarks rely on simple, self-contained prompts, failing to capture the complexity of real-world captions. Human-written captions often involve multiple interacting subjects, rich contextual references, and abstractive phrasing, conditions under which current image-text encoders like CLIP struggle. To systematically study these deficiencies, we introduce ANCHOR, a large-scale dataset of 70K+ abstractive captions sourced from five major news media organizations. Analysis with ANCHOR reveals persistent failures in multi-subject understanding, context reasoning, and nuanced grounding. Motivated by these challenges, we propose Subject-Aware Fine-tuning (SAFE), which uses Large Language Models (LLMs) to extract key subjects and enhance their representation at the embedding-level. Experiments with contemporary models show that SAFE significantly improves image-caption consistency and human preference alignment, serving as a practical and scalable solution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ANCHOR gives a useful new dataset of real news captions that expose T2I weaknesses on complex prompts, and SAFE is a straightforward LLM-based fix, but the abstract supplies zero numbers or baselines to show it works.

read the letter

The paper introduces ANCHOR, a 70k+ dataset of abstractive captions pulled from major news outlets, and SAFE, which routes those captions through an LLM to identify key subjects and then boosts their embeddings during fine-tuning of the text encoder. That dataset is the clearest new piece: standard T2I benchmarks use short, self-contained prompts, so news-style captions with multiple interacting subjects and context are a reasonable way to surface the actual failure modes in CLIP-style encoders. The SAFE approach is also a direct response to those failures rather than another round of generic prompt tuning. Both pieces address a practical gap that shows up in deployed systems. The evaluation is the main gap. The abstract states that experiments on contemporary models show significant gains in image-caption consistency and human preference, yet it gives no metrics, no baseline comparisons, no ablations on the LLM extraction step, and no failure-case analysis. Without those details it is impossible to tell whether the subject extraction is accurate enough or whether the embedding boost trades off other prompt elements. The assumption that LLM-identified subjects are the load-bearing parts of the caption is plausible but untested in the provided text. This work is aimed at researchers and engineers who build or fine-tune T2I systems for media and creative tools and who already know the CLIP limitations on long prompts. A reader in that group could take the dataset and the high-level method as a starting point even if the results need more evidence. I would send it to peer review because the problem framing and the dataset are concrete and the method is simple enough to implement and test; the current version would just need the experiments expanded before acceptance.

Referee Report

1 major / 0 minor

Summary. The paper introduces ANCHOR, a dataset of 70K+ abstractive captions from news media, to expose failures of current T2I models on multi-subject, contextual, and abstractive prompts. It proposes Subject-Aware Fine-tuning (SAFE), which employs LLMs to extract key subjects and perform embedding-level enhancement, and asserts that experiments on contemporary models demonstrate significant gains in image-caption consistency and human preference alignment.

Significance. If the claimed gains are reproducible and the method does not introduce new failure modes, the dataset would be a useful benchmark resource and SAFE could offer a scalable, LLM-assisted route to better subject grounding in T2I systems that currently rely on CLIP-style encoders.

major comments (1)

[Abstract] Abstract: the central claim that 'experiments with contemporary models show that SAFE significantly improves image-caption consistency and human preference alignment' is unsupported by any reported metrics, baselines, ablation results, statistical tests, or implementation details, leaving the empirical contribution unevaluable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the opportunity to clarify the empirical support for our claims. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'experiments with contemporary models show that SAFE significantly improves image-caption consistency and human preference alignment' is unsupported by any reported metrics, baselines, ablation results, statistical tests, or implementation details, leaving the empirical contribution unevaluable.

Authors: We agree that the abstract, due to length constraints, summarizes the experimental outcomes without citing specific numbers or section references. The full manuscript reports quantitative results in Section 4 (including consistency metrics such as subject grounding accuracy and CLIP-based alignment scores), human preference studies in Section 5, baseline comparisons against standard T2I fine-tuning and other conditioning methods, ablation studies on the LLM extraction and embedding enhancement components, and implementation details in the appendix. To make this support explicit, we will revise the abstract to include one or two representative metrics and add cross-references to the relevant sections and tables. revision: yes

Circularity Check

0 steps flagged

No derivation chain or self-referential fitting present

full rationale

The paper introduces an empirical dataset (ANCHOR) and a fine-tuning method (SAFE) that relies on external LLM subject extraction followed by embedding enhancement. No equations, derivations, predictions, or first-principles results are claimed or present in the provided text. The central claims rest on experimental outcomes rather than any reduction to inputs by construction, self-citation load-bearing, or ansatz smuggling. This is a standard empirical contribution with no circularity in a derivation sense.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach builds on standard LLM and T2I components without introducing new postulated objects.

pith-pipeline@v0.9.0 · 5695 in / 910 out tokens · 20969 ms · 2026-05-24T01:40:06.244693+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 5 internal anchors

[1]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, April

work page internal anchor Pith review Pith/arXiv arXiv
[2]

VQGAN-CLIP: Open domain image generation and editing with natural language guidance

Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. VQGAN-CLIP: Open domain image generation and editing with natural language guidance. arXiv:2204.08583 [cs], April

work page arXiv
[3]

ArcFace: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. ArcFace: Additive angular margin loss for deep face recognition. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . IEEE, June

work page 2019
[4]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pp. 7514–7528, Online and Punta Cana, Dominican Republic, November

work page 2021
[5]

LLM-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models

11 Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. LLM-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. ArXiv, abs/2305.13655, May

work page arXiv
[6]

Visual news: Benchmark and challenges in news image captioning

Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Ordonez. Visual news: Benchmark and challenges in news image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6761–6771, Online and Punta Cana, Dominican Republic, November

work page 2021
[7]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Association for Computational Linguistics. Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image generation and editing with Text-Guided diffusion models. arXiv:2112.10741 [cs], March

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Pragmatic Issue-Sensitive image captioning

Allen Nie, Reuben Cohn-Gordon, and Christopher Potts. Pragmatic Issue-Sensitive image captioning. In Findings of the Association for Computational Linguistics: EMNLP 2020 , pp. 1924–1938, Online, November

work page 2020
[9]

Mayu Otani, Riku Togashi, Yu Sawai, Ryosuke Ishigami, Yuta Nakashima, Esa Rahtu, Janne Heikkilä, and Shin’ichi Satoh

Association for Computational Linguistics. Mayu Otani, Riku Togashi, Yu Sawai, Ryosuke Ishigami, Yuta Nakashima, Esa Rahtu, Janne Heikkilä, and Shin’ichi Satoh. Toward verifiable and reproducible human evaluation for text-to-image generation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14277–14286. IEEE, June

work page 2023
[10]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. arXiv:2103.00020 [cs], February

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional image generation with CLIP latents. arXiv:2204.06125 [cs], April

work page internal anchor Pith review Pith/arXiv arXiv
[12]

sentence-transformers/all-MiniLM-L6-v2 · hugging face

Nils Reimers. sentence-transformers/all-MiniLM-L6-v2 · hugging face. https:// huggingface.co/sentence-transformers/all-MiniLM-L6-v2 . Accessed: 2024-4-5. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Confer- ence on Computer Vision and Patter...

work page 2024
[13]

DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June

work page 2023
[14]

FaceNet: A unified embedding for face recognition and clustering

Florian Schroff, Dmitry Kalenichenko, and James Philbin. FaceNet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823. IEEE, June

work page 2015
[15]

Transform and tell: Entity-aware news image captioning

Alasdair Tran, Alexander Mathews, and Lexing Xie. Transform and tell: Entity-aware news image captioning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13032–13042, Seattle, WA, USA,

work page 2020
[16]

Context-Aware captions from Context-Agnostic supervision

Ramakrishna Vedantam, Samy Bengio, Kevin Murphy, Devi Parikh, and Gal Chechik. Context-Aware captions from Context-Agnostic supervision. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 1070–1079. IEEE, July

work page 2017
[17]

AttnGAN: Fine-Grained text to image generation with attentional genera- tive adversarial networks

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-Grained text to image generation with attentional genera- tive adversarial networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1316–1324, Salt Lake City, UT, USA,

work page 2018
[18]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Under- standing deep learning requires rethinking generalization. arXiv:1611.03530 [cs], February

work page internal anchor Pith review Pith/arXiv arXiv
[19]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 586–595, Salt Lake City, UT,

work page 2018
[20]

DM-GAN: Dynamic memory generative adversarial networks for Text-To-Image synthesis

Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. DM-GAN: Dynamic memory generative adversarial networks for Text-To-Image synthesis. In 2019 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) , pp. 5795–5803, Long Beach, CA, USA,

work page 2019
[21]

14 A Dataset Insights A.1 Caption Statistics In this section, we provide additional statistics on the ANCHOR dataset and analyze the distribution of image-caption pairs

IEEE. 14 A Dataset Insights A.1 Caption Statistics In this section, we provide additional statistics on the ANCHOR dataset and analyze the distribution of image-caption pairs. In Table 4, we provide caption statistics of ANCHOR compared to 2 popular image-caption pair datasets: COCO Captions Chen et al. (2015) and Conceptual Captions 3M (CC3M) Sharma et a...

work page 2015
[22]

We selected a scale multiplier of x2 as it scores the highest in 2 out of 3 metrics tested

Here, x1 refers to a scale factor of1.1, x2 refers to a scale factor of (1.1)2, and so forth. We selected a scale multiplier of x2 as it scores the highest in 2 out of 3 metrics tested. Increasing the scale multiplier beyond x2 does not provide any meaningful improvement in generation performance. Model FIDCLIP (↓) ImageReward (↑) HPS V2 ( ↑) SAFE (DFE + ...

work page 2023
[23]

We report the average metric scores across all entity classes in Table 7 to visualize the overall quality of generated images

face recognition model is used for calculating Identity Preservation scores. We report the average metric scores across all entity classes in Table 7 to visualize the overall quality of generated images. We also present Figure 8 with examples Ex1, Ex2, Ex3, Ex4 generated using the ANCHOR Entity Test Set. We observe that qualitatively, the generated images...

work page 2023
[24]

Donald Trump waves to the crowd during a campaign rally on June 18 2016 in Phoenix

London’s mayor Boris Johnson gives a big thumbs up to photographers during the unveiling of the 2012 Olympic rings on Tower Bridge. Donald Trump waves to the crowd during a campaign rally on June 18 2016 in Phoenix. Hillary Clinton greets audience members following a campaign organizing event at Eagle Heights elementary in Clinton Iowa. Figure 8: Qualitat...

work page 2012

[1] [1]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, April

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

VQGAN-CLIP: Open domain image generation and editing with natural language guidance

Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. VQGAN-CLIP: Open domain image generation and editing with natural language guidance. arXiv:2204.08583 [cs], April

work page arXiv

[3] [3]

ArcFace: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. ArcFace: Additive angular margin loss for deep face recognition. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . IEEE, June

work page 2019

[4] [4]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pp. 7514–7528, Online and Punta Cana, Dominican Republic, November

work page 2021

[5] [5]

LLM-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models

11 Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. LLM-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. ArXiv, abs/2305.13655, May

work page arXiv

[6] [6]

Visual news: Benchmark and challenges in news image captioning

Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Ordonez. Visual news: Benchmark and challenges in news image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6761–6771, Online and Punta Cana, Dominican Republic, November

work page 2021

[7] [7]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Association for Computational Linguistics. Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image generation and editing with Text-Guided diffusion models. arXiv:2112.10741 [cs], March

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Pragmatic Issue-Sensitive image captioning

Allen Nie, Reuben Cohn-Gordon, and Christopher Potts. Pragmatic Issue-Sensitive image captioning. In Findings of the Association for Computational Linguistics: EMNLP 2020 , pp. 1924–1938, Online, November

work page 2020

[9] [9]

Mayu Otani, Riku Togashi, Yu Sawai, Ryosuke Ishigami, Yuta Nakashima, Esa Rahtu, Janne Heikkilä, and Shin’ichi Satoh

Association for Computational Linguistics. Mayu Otani, Riku Togashi, Yu Sawai, Ryosuke Ishigami, Yuta Nakashima, Esa Rahtu, Janne Heikkilä, and Shin’ichi Satoh. Toward verifiable and reproducible human evaluation for text-to-image generation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14277–14286. IEEE, June

work page 2023

[10] [10]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. arXiv:2103.00020 [cs], February

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional image generation with CLIP latents. arXiv:2204.06125 [cs], April

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

sentence-transformers/all-MiniLM-L6-v2 · hugging face

Nils Reimers. sentence-transformers/all-MiniLM-L6-v2 · hugging face. https:// huggingface.co/sentence-transformers/all-MiniLM-L6-v2 . Accessed: 2024-4-5. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Confer- ence on Computer Vision and Patter...

work page 2024

[13] [13]

DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June

work page 2023

[14] [14]

FaceNet: A unified embedding for face recognition and clustering

Florian Schroff, Dmitry Kalenichenko, and James Philbin. FaceNet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823. IEEE, June

work page 2015

[15] [15]

Transform and tell: Entity-aware news image captioning

Alasdair Tran, Alexander Mathews, and Lexing Xie. Transform and tell: Entity-aware news image captioning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13032–13042, Seattle, WA, USA,

work page 2020

[16] [16]

Context-Aware captions from Context-Agnostic supervision

Ramakrishna Vedantam, Samy Bengio, Kevin Murphy, Devi Parikh, and Gal Chechik. Context-Aware captions from Context-Agnostic supervision. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 1070–1079. IEEE, July

work page 2017

[17] [17]

AttnGAN: Fine-Grained text to image generation with attentional genera- tive adversarial networks

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-Grained text to image generation with attentional genera- tive adversarial networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1316–1324, Salt Lake City, UT, USA,

work page 2018

[18] [18]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Under- standing deep learning requires rethinking generalization. arXiv:1611.03530 [cs], February

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 586–595, Salt Lake City, UT,

work page 2018

[20] [20]

DM-GAN: Dynamic memory generative adversarial networks for Text-To-Image synthesis

Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. DM-GAN: Dynamic memory generative adversarial networks for Text-To-Image synthesis. In 2019 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) , pp. 5795–5803, Long Beach, CA, USA,

work page 2019

[21] [21]

14 A Dataset Insights A.1 Caption Statistics In this section, we provide additional statistics on the ANCHOR dataset and analyze the distribution of image-caption pairs

IEEE. 14 A Dataset Insights A.1 Caption Statistics In this section, we provide additional statistics on the ANCHOR dataset and analyze the distribution of image-caption pairs. In Table 4, we provide caption statistics of ANCHOR compared to 2 popular image-caption pair datasets: COCO Captions Chen et al. (2015) and Conceptual Captions 3M (CC3M) Sharma et a...

work page 2015

[22] [22]

We selected a scale multiplier of x2 as it scores the highest in 2 out of 3 metrics tested

Here, x1 refers to a scale factor of1.1, x2 refers to a scale factor of (1.1)2, and so forth. We selected a scale multiplier of x2 as it scores the highest in 2 out of 3 metrics tested. Increasing the scale multiplier beyond x2 does not provide any meaningful improvement in generation performance. Model FIDCLIP (↓) ImageReward (↑) HPS V2 ( ↑) SAFE (DFE + ...

work page 2023

[23] [23]

We report the average metric scores across all entity classes in Table 7 to visualize the overall quality of generated images

face recognition model is used for calculating Identity Preservation scores. We report the average metric scores across all entity classes in Table 7 to visualize the overall quality of generated images. We also present Figure 8 with examples Ex1, Ex2, Ex3, Ex4 generated using the ANCHOR Entity Test Set. We observe that qualitatively, the generated images...

work page 2023

[24] [24]

Donald Trump waves to the crowd during a campaign rally on June 18 2016 in Phoenix

London’s mayor Boris Johnson gives a big thumbs up to photographers during the unveiling of the 2012 Olympic rings on Tower Bridge. Donald Trump waves to the crowd during a campaign rally on June 18 2016 in Phoenix. Hillary Clinton greets audience members following a campaign organizing event at Eagle Heights elementary in Clinton Iowa. Figure 8: Qualitat...

work page 2012