pith. sign in

arxiv: 2605.19111 · v1 · pith:FAA2UQC3new · submitted 2026-05-18 · 💻 cs.CV · cs.AI

FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models

Pith reviewed 2026-05-20 10:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords text-to-image evaluationfactual correctnessLLM agentsVLM verificationimage refinementfactuality metricsA/B testingagentic framework
0
0 comments X

The pith

FAGER evaluates text-to-image models on factual correctness using agentic LLM and VLM methods and outperforms prior metrics on implied facts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing metrics for text-to-image generation check only explicitly stated prompt details, missing implicit factual needs in scientific, historical, or cultural prompts. FAGER addresses this by building a factual rubric through LLM fact proposals, reference-guided visual extraction, and verification, then using VLM to answer QA pairs derived from the rubric. This allows both accurate evaluation and training-free refinement of generated images. A sympathetic reader would care because better factuality in generated images supports reliable use in education, media, and knowledge sharing. The approach shows consistent gains across five diverse datasets spanning science, history, products, culture, and knowledge-intensive concepts.

Core claim

FAGER constructs a structured factual rubric by combining LLM-based fact proposal with reference-guided visual fact extraction and verification, then converts the rubric into question-answer pairs for VLM-based evaluation. Validated via a Factual A/B test that prefers factual reference images over generated ones, it consistently outperforms prior metrics across five datasets and enables fully training-free refinement of T2I outputs with substantial factuality gains.

What carries the argument

The FAGER agentic framework, which grounds evaluation in visually verifiable facts proposed by LLMs and extracted via references then verified through VLM QA pairs, carrying the argument by shifting focus from prompt text alignment to external factual correctness.

If this is right

  • FAGER outperforms prior metrics on the Factual A/B test across science, history, products, culture, and knowledge-intensive datasets.
  • T2I outputs can be refined using FAGER feedback in a fully training-free manner.
  • Refinement yields substantial factuality gains on the tested domains.
  • The framework supplies actionable feedback that goes beyond binary scoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rubric-and-QA approach could transfer to factuality checks in text-to-video or text-to-3D generation where temporal or spatial consistency matters.
  • Embedding FAGER into generation pipelines might allow iterative self-correction before the final image is produced.
  • Reducing dependence on external reference images would expand use to open-ended creative prompts.

Load-bearing premise

LLM-proposed facts combined with reference-guided visual extraction and VLM-based QA pairs will reliably capture visually verifiable factual correctness without systematic bias or hallucination in the fact proposal or verification steps.

What would settle it

A dataset in which human judges find that FAGER prefers less factual generated images over reference images in the A/B test, or where refinement steps based on FAGER feedback produce no measurable increase in correctly depicted facts.

Figures

Figures reproduced from arXiv: 2605.19111 by Cusuh Ham, Deepti Ghadiyaram, Pin-Yu Chen, Youngsun Lim.

Figure 1
Figure 1. Figure 1: Overview of real and nonfactual images and their evaluations by different T2I metrics. Existing metrics such as VQAScore [18] and FineGRAIN [10] fail to properly assess facts implied by the prompt. In the ethanol example, VQAScore assigns higher scores to nonfactual images, while FineGRAIN focuses on a prompt-explicit question that does not distinguish factual cor￾rectness. In contrast, FAGER evaluates pro… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed FAGER pipeline on the prompt “A molecule of Ethanol.” The fact proposal agent generates structured candidate facts from the prompt using prior knowledge. In parallel, the reference-guided fact extraction agent extracts visually observable facts from a reference image. The verification agent then consolidates the two sources into a verified factual rubric, adding missing factual req… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the Factual A/B test with FAGER score. For each prompt, we compare the FAGER scores of a real image and a generated image. The left example shows a correctly ranked case where the real image scores higher. The middle example shows a tie, which we count as correct because the generated image may also satisfy the factual requirements of the prompt. The right example shows a failure case where… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative examples of FAGER-guided image refinement. The first three rows show edit cases, where FAGER identifies localized factual errors and produces targeted edit instructions that improve factual correctness while preserving the overall image content. These examples include correcting the flame color in a copper flame test, revising product-specific packaging details, and fixing the visual structure … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative examples where FAGER-guided refinement [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Existing text-to-image (T2I) evaluation metrics mainly assess whether generated images align with information explicitly stated in the prompt, but often fail to capture factual requirements that are implicit, externally grounded, or identity-defining. As a result, they are not well suited for evaluating factual correctness in prompts involving scientific knowledge, historical facts, products, or culture-specific concepts. We propose FActually Grounded Evaluation and Refinement (FAGER), an agentic framework that evaluates whether generated images correctly reflect visually verifiable facts grounded in or implied by the prompt, while also providing actionable feedback for improvement. FAGER first constructs a structured factual rubric by combining LLM-based fact proposal with reference-guided visual fact extraction and verification, then converts the rubric into question-answer pairs for VLM-based evaluation. To validate FAGER as a factuality metric, we introduce a Factual A/B test, which measures whether a metric prefers factual reference images over corresponding generated images. Across five datasets spanning science, history, products, culture, and knowledge-intensive concepts, FAGER consistently outperforms prior metrics on this test. We further show that FAGER can be used to refine T2I outputs in a fully training-free manner, yielding substantial factuality gains across datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes FAGER, an agentic framework for factually grounded evaluation and refinement of text-to-image (T2I) models. It constructs a factual rubric using LLM-based fact proposal combined with reference-guided visual fact extraction and verification, converts it to VLM-based QA pairs for scoring, and validates this via a Factual A/B test showing consistent outperformance over prior metrics across five datasets in science, history, products, culture, and knowledge-intensive concepts. Additionally, it demonstrates training-free refinement of T2I outputs using the framework.

Significance. If the results are robust, this work could provide a valuable tool for evaluating and improving factual accuracy in T2I generations for prompts requiring external knowledge, where current metrics fall short. The training-free refinement is a notable strength, offering practical utility without additional training. The introduction of the Factual A/B test as a validation method is an interesting contribution to metric evaluation.

major comments (1)
  1. [Factual A/B test and rubric construction (abstract and experiments)] The Factual A/B test (described in the abstract and presumably detailed in the experiments section) measures whether the metric assigns higher scores to factual reference images than to T2I outputs. However, the rubric itself is constructed via LLM fact proposal plus reference-guided visual extraction and verification; this means reference images are aligned with the rubric by design. The test therefore risks confirming consistency with the reference-derived facts rather than independently validating detection of factual errors in generated images. The manuscript does not describe external human validation, knowledge-base cross-checks, or controls for LLM/VLM biases in fact proposal, which is load-bearing for the claim that FAGER captures 'visually verifiable factual correctness'.
minor comments (1)
  1. [Abstract] The abstract states 'consistent outperformance' without quantitative details or statistical significance; adding a brief summary of effect sizes or p-values would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential utility of FAGER for factuality evaluation in text-to-image generation. We address the major comment on the Factual A/B test and rubric construction below.

read point-by-point responses
  1. Referee: [Factual A/B test and rubric construction (abstract and experiments)] The Factual A/B test (described in the abstract and presumably detailed in the experiments section) measures whether the metric assigns higher scores to factual reference images than to T2I outputs. However, the rubric itself is constructed via LLM fact proposal plus reference-guided visual extraction and verification; this means reference images are aligned with the rubric by design. The test therefore risks confirming consistency with the reference-derived facts rather than independently validating detection of factual errors in generated images. The manuscript does not describe external human validation, knowledge-base cross-checks, or controls for LLM/VLM biases in fact proposal, which is load-bearing for the claim that FAGER captures 'visually verifiable factual correctness'.

    Authors: We appreciate the referee's identification of this potential circularity concern. The LLM-based fact proposal generates candidate facts solely from the input prompt, independent of any visual input. The subsequent reference-guided visual fact extraction and verification step then selects and confirms only those facts that can be visually verified in the reference images, producing a rubric of prompt-implied, visually grounded facts. The Factual A/B test evaluates whether the downstream VLM-based QA scoring assigns higher factuality to these reference images (which contain the verified facts by construction) than to T2I outputs that may deviate from them. This design provides a controlled proxy for the metric's sensitivity to factual inaccuracies, as the references serve as the canonical visual realization of the rubric facts. That said, we acknowledge that the manuscript does not include separate external human validation, knowledge-base cross-checks, or explicit bias controls for the LLM/VLM components, and that these would provide stronger independent corroboration. We will revise the manuscript to explicitly discuss this limitation, clarify the independence of the fact-proposal stage, and outline how future work could incorporate human studies or external KB verification to further substantiate the claims. revision: partial

Circularity Check

1 steps flagged

FAGER A/B validation partially relies on reference-guided construction for its preference result

specific steps
  1. fitted input called prediction [Abstract]
    "FAGER first constructs a structured factual rubric by combining LLM-based fact proposal with reference-guided visual fact extraction and verification, then converts the rubric into question-answer pairs for VLM-based evaluation. To validate FAGER as a factuality metric, we introduce a Factual A/B test, which measures whether a metric prefers factual reference images over corresponding generated images."

    The rubric facts are extracted directly from the reference image. Consequently the VLM QA evaluation on that same reference image matches the rubric by construction, guaranteeing the metric prefers the reference in the A/B comparison. This preference result is therefore statistically forced by the metric's own reference-dependent construction rather than serving as an external test of factual correctness.

full rationale

The paper's core validation introduces a Factual A/B test that checks whether the metric assigns higher scores to factual reference images than to T2I outputs. However, FAGER constructs its factual rubric explicitly via reference-guided visual fact extraction and verification before applying VLM QA. This makes higher scores on the reference arm tautological for FAGER (the rubric is derived from the same image), while prior metrics lack this reference access. The outperformance claim therefore receives some forced support from the test design itself rather than fully independent evidence of factual detection. No mathematical self-definition, self-citation chains, or ansatz smuggling appear; the framework retains independent empirical content on refinement and multi-domain datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the unstated assumption that current LLMs and VLMs can be trusted to propose and verify facts accurately enough for the rubric to be reliable. No free parameters or invented entities are mentioned in the abstract.

axioms (2)
  • domain assumption LLM-based fact proposal combined with reference-guided visual fact extraction produces a reliable structured factual rubric.
    This is invoked in the description of how FAGER constructs the rubric before VLM evaluation.
  • domain assumption VLM answers to the generated QA pairs accurately reflect visual factual correctness.
    Central to the evaluation step described in the abstract.

pith-pipeline@v0.9.0 · 5760 in / 1452 out tokens · 27829 ms · 2026-05-20T10:35:37.262791+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 2 internal anchors

  1. [1]

    Qwen-image-edit-2511.https : //huggingface.co/Qwen/Qwen- Image- Edit- 2511, 2024

    Alibaba Group. Qwen-image-edit-2511.https : //huggingface.co/Qwen/Qwen- Image- Edit- 2511, 2024. Model card. 7

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 4

  3. [3]

    Imagen 3.arXiv preprint arXiv:2408.07009, 2024

    Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, et al. Imagen 3.arXiv preprint arXiv:2408.07009, 2024. 1

  4. [4]

    Top-down facilitation of visual recog- nition.Proceedings of the national academy of sciences, 103 (2):449–454, 2006

    Moshe Bar, Karim S Kassam, Avniel Singh Ghuman, Jasmine Boshyan, Annette M Schmid, Anders M Dale, Matti S H¨am¨al¨ainen, Ksenija Marinkovic, Daniel L Schacter, Bruce R Rosen, et al. Top-down facilitation of visual recog- nition.Proceedings of the national academy of sciences, 103 (2):449–454, 2006. 2

  5. [5]

    Flux.1 [dev].https : / / huggingface

    Black Forest Labs. Flux.1 [dev].https : / / huggingface . co / black - forest - labs / FLUX . 1-dev, 2024. 1, 6, 7

  6. [6]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space,

    Black Forest Labs. FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space,

  7. [7]

    Dall-eval: Probing the reasoning skills and social biases of text-to- image generation models

    Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to- image generation models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3043– 3054, 2023. 2

  8. [8]

    Abo: Dataset and benchmarks for real-world 3d object un- derstanding

    Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object un- derstanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126– 21136, 2022. 5

  9. [9]

    Gemini 3 pro image preview.https : / / ai

    Google. Gemini 3 pro image preview.https : / / ai . google . dev / gemini - api / docs / models / gemini- 3- pro- image- preview, 2026. Google AI for Developers documentation. Accessed: 2026-04-07. 6, 8

  10. [10]

    Finegrain: Evaluating failure modes of text-to-image models with vision language model judges

    Kevin David Hayes, Micah Goldblum, Vikash Sehwag, Gowthami Somepalli, Ashwinee Panda, and Tom Goldstein. Finegrain: Evaluating failure modes of text-to-image models with vision language model judges. InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. 1, 2, 6

  11. [11]

    Clipscore: A reference-free evaluation met- ric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the 2021 confer- ence on empirical methods in natural language processing, pages 7514–7528, 2021. 2, 8

  12. [12]

    Tifa: Accu- rate and interpretable text-to-image faithfulness evaluation with question answering

    Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accu- rate and interpretable text-to-image faithfulness evaluation with question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406– 20417, 2023. 2, 3

  13. [13]

    T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts

    Ziwei Huang, Wanggui He, Quanyu Long, Yandi Wang, Haoyuan Li, Zhelun Yu, Fangxun Shu, Long Chan, Hao Jiang, Fei Wu, et al. T2i-factualbench: Benchmarking the factuality of text-to-image models with knowledge-intensive concepts.arXiv preprint arXiv:2412.04300, 2024. 2, 5

  14. [14]

    yes" or

    Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, and Marjan Ghazvininejad. Geneval 2: Addressing benchmark drift in text-to-image evaluation. arXiv preprint arXiv:2512.16853, 2025. 1

  15. [15]

    FLUX.2: Frontier Visual Intelligence

    Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025. 1, 5, 6

  16. [16]

    Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

    Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Gra- ham Neubig, et al. Genai-bench: Evaluating and improv- ing compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024. 3

  17. [17]

    Evalu- ating image hallucination in text-to-image generation with question-answering

    Youngsun Lim, Hojun Choi, and Hyunjung Shim. Evalu- ating image hallucination in text-to-image generation with question-answering. InProceedings of the AAAI Conference on Artificial Intelligence, pages 26290–26298, 2025. 3, 5

  18. [18]

    Evaluating text-to-visual generation with image-to-text gen- eration

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text gen- eration. InEuropean Conference on Computer Vision, pages 366–384. Springer, 2024. 1, 2, 6

  19. [19]

    Dall-e 3 system card.https://openai.com/ index/dall-e-3-system-card/, 2023

    OpenAI. Dall-e 3 system card.https://openai.com/ index/dall-e-3-system-card/, 2023. Accessed: August 15, 2024. 5

  20. [20]

    Update to GPT-5 system card: GPT-5.2

    OpenAI. Update to GPT-5 system card: GPT-5.2. Technical report, OpenAI, 2025. 5

  21. [21]

    Introducing gpt-5.4 mini and nano.https: //openai.com/index/introducing-gpt-5-4- mini-and-nano/, 2026

    OpenAI. Introducing gpt-5.4 mini and nano.https: //openai.com/index/introducing-gpt-5-4- mini-and-nano/, 2026. 4

  22. [22]

    From blobs to boundary edges: Evidence for time-and spatial-scale-dependent scene recognition.Psychological science, 5(4):195–200, 1994

    Philippe G Schyns and Aude Oliva. From blobs to boundary edges: Evidence for time-and spatial-scale-dependent scene recognition.Psychological science, 5(4):195–200, 1994. 2

  23. [23]

    Stable diffusion 3.5 large.https : / / huggingface

    Stability AI. Stable diffusion 3.5 large.https : / / huggingface . co / stabilityai / stable - diffusion-3.5-large, 2024. Model card. 1, 6

  24. [24]

    What you see is what you read? improving text- image alignment evaluation.Advances in Neural Informa- tion Processing Systems, 36:1601–1619, 2023

    Michal Yarom, Yonatan Bitton, Soravit Changpinyo, Roee Aharoni, Jonathan Herzig, Oran Lang, Eran Ofek, and Idan Szpektor. What you see is what you read? improving text- image alignment evaluation.Advances in Neural Informa- tion Processing Systems, 36:1601–1619, 2023. 2

  25. [25]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 8 9