FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models
Pith reviewed 2026-05-20 10:35 UTC · model grok-4.3
The pith
FAGER evaluates text-to-image models on factual correctness using agentic LLM and VLM methods and outperforms prior metrics on implied facts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FAGER constructs a structured factual rubric by combining LLM-based fact proposal with reference-guided visual fact extraction and verification, then converts the rubric into question-answer pairs for VLM-based evaluation. Validated via a Factual A/B test that prefers factual reference images over generated ones, it consistently outperforms prior metrics across five datasets and enables fully training-free refinement of T2I outputs with substantial factuality gains.
What carries the argument
The FAGER agentic framework, which grounds evaluation in visually verifiable facts proposed by LLMs and extracted via references then verified through VLM QA pairs, carrying the argument by shifting focus from prompt text alignment to external factual correctness.
If this is right
- FAGER outperforms prior metrics on the Factual A/B test across science, history, products, culture, and knowledge-intensive datasets.
- T2I outputs can be refined using FAGER feedback in a fully training-free manner.
- Refinement yields substantial factuality gains on the tested domains.
- The framework supplies actionable feedback that goes beyond binary scoring.
Where Pith is reading between the lines
- The rubric-and-QA approach could transfer to factuality checks in text-to-video or text-to-3D generation where temporal or spatial consistency matters.
- Embedding FAGER into generation pipelines might allow iterative self-correction before the final image is produced.
- Reducing dependence on external reference images would expand use to open-ended creative prompts.
Load-bearing premise
LLM-proposed facts combined with reference-guided visual extraction and VLM-based QA pairs will reliably capture visually verifiable factual correctness without systematic bias or hallucination in the fact proposal or verification steps.
What would settle it
A dataset in which human judges find that FAGER prefers less factual generated images over reference images in the A/B test, or where refinement steps based on FAGER feedback produce no measurable increase in correctly depicted facts.
Figures
read the original abstract
Existing text-to-image (T2I) evaluation metrics mainly assess whether generated images align with information explicitly stated in the prompt, but often fail to capture factual requirements that are implicit, externally grounded, or identity-defining. As a result, they are not well suited for evaluating factual correctness in prompts involving scientific knowledge, historical facts, products, or culture-specific concepts. We propose FActually Grounded Evaluation and Refinement (FAGER), an agentic framework that evaluates whether generated images correctly reflect visually verifiable facts grounded in or implied by the prompt, while also providing actionable feedback for improvement. FAGER first constructs a structured factual rubric by combining LLM-based fact proposal with reference-guided visual fact extraction and verification, then converts the rubric into question-answer pairs for VLM-based evaluation. To validate FAGER as a factuality metric, we introduce a Factual A/B test, which measures whether a metric prefers factual reference images over corresponding generated images. Across five datasets spanning science, history, products, culture, and knowledge-intensive concepts, FAGER consistently outperforms prior metrics on this test. We further show that FAGER can be used to refine T2I outputs in a fully training-free manner, yielding substantial factuality gains across datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FAGER, an agentic framework for factually grounded evaluation and refinement of text-to-image (T2I) models. It constructs a factual rubric using LLM-based fact proposal combined with reference-guided visual fact extraction and verification, converts it to VLM-based QA pairs for scoring, and validates this via a Factual A/B test showing consistent outperformance over prior metrics across five datasets in science, history, products, culture, and knowledge-intensive concepts. Additionally, it demonstrates training-free refinement of T2I outputs using the framework.
Significance. If the results are robust, this work could provide a valuable tool for evaluating and improving factual accuracy in T2I generations for prompts requiring external knowledge, where current metrics fall short. The training-free refinement is a notable strength, offering practical utility without additional training. The introduction of the Factual A/B test as a validation method is an interesting contribution to metric evaluation.
major comments (1)
- [Factual A/B test and rubric construction (abstract and experiments)] The Factual A/B test (described in the abstract and presumably detailed in the experiments section) measures whether the metric assigns higher scores to factual reference images than to T2I outputs. However, the rubric itself is constructed via LLM fact proposal plus reference-guided visual extraction and verification; this means reference images are aligned with the rubric by design. The test therefore risks confirming consistency with the reference-derived facts rather than independently validating detection of factual errors in generated images. The manuscript does not describe external human validation, knowledge-base cross-checks, or controls for LLM/VLM biases in fact proposal, which is load-bearing for the claim that FAGER captures 'visually verifiable factual correctness'.
minor comments (1)
- [Abstract] The abstract states 'consistent outperformance' without quantitative details or statistical significance; adding a brief summary of effect sizes or p-values would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential utility of FAGER for factuality evaluation in text-to-image generation. We address the major comment on the Factual A/B test and rubric construction below.
read point-by-point responses
-
Referee: [Factual A/B test and rubric construction (abstract and experiments)] The Factual A/B test (described in the abstract and presumably detailed in the experiments section) measures whether the metric assigns higher scores to factual reference images than to T2I outputs. However, the rubric itself is constructed via LLM fact proposal plus reference-guided visual extraction and verification; this means reference images are aligned with the rubric by design. The test therefore risks confirming consistency with the reference-derived facts rather than independently validating detection of factual errors in generated images. The manuscript does not describe external human validation, knowledge-base cross-checks, or controls for LLM/VLM biases in fact proposal, which is load-bearing for the claim that FAGER captures 'visually verifiable factual correctness'.
Authors: We appreciate the referee's identification of this potential circularity concern. The LLM-based fact proposal generates candidate facts solely from the input prompt, independent of any visual input. The subsequent reference-guided visual fact extraction and verification step then selects and confirms only those facts that can be visually verified in the reference images, producing a rubric of prompt-implied, visually grounded facts. The Factual A/B test evaluates whether the downstream VLM-based QA scoring assigns higher factuality to these reference images (which contain the verified facts by construction) than to T2I outputs that may deviate from them. This design provides a controlled proxy for the metric's sensitivity to factual inaccuracies, as the references serve as the canonical visual realization of the rubric facts. That said, we acknowledge that the manuscript does not include separate external human validation, knowledge-base cross-checks, or explicit bias controls for the LLM/VLM components, and that these would provide stronger independent corroboration. We will revise the manuscript to explicitly discuss this limitation, clarify the independence of the fact-proposal stage, and outline how future work could incorporate human studies or external KB verification to further substantiate the claims. revision: partial
Circularity Check
FAGER A/B validation partially relies on reference-guided construction for its preference result
specific steps
-
fitted input called prediction
[Abstract]
"FAGER first constructs a structured factual rubric by combining LLM-based fact proposal with reference-guided visual fact extraction and verification, then converts the rubric into question-answer pairs for VLM-based evaluation. To validate FAGER as a factuality metric, we introduce a Factual A/B test, which measures whether a metric prefers factual reference images over corresponding generated images."
The rubric facts are extracted directly from the reference image. Consequently the VLM QA evaluation on that same reference image matches the rubric by construction, guaranteeing the metric prefers the reference in the A/B comparison. This preference result is therefore statistically forced by the metric's own reference-dependent construction rather than serving as an external test of factual correctness.
full rationale
The paper's core validation introduces a Factual A/B test that checks whether the metric assigns higher scores to factual reference images than to T2I outputs. However, FAGER constructs its factual rubric explicitly via reference-guided visual fact extraction and verification before applying VLM QA. This makes higher scores on the reference arm tautological for FAGER (the rubric is derived from the same image), while prior metrics lack this reference access. The outperformance claim therefore receives some forced support from the test design itself rather than fully independent evidence of factual detection. No mathematical self-definition, self-citation chains, or ansatz smuggling appear; the framework retains independent empirical content on refinement and multi-domain datasets.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM-based fact proposal combined with reference-guided visual fact extraction produces a reliable structured factual rubric.
- domain assumption VLM answers to the generated QA pairs accurately reflect visual factual correctness.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FAGER first constructs a structured factual rubric by combining LLM-based fact proposal with reference-guided visual fact extraction and verification, then converts the rubric into question-answer pairs for VLM-based evaluation.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We organize facts into three levels: Level 1: Object identity... Level 2: Key component verification... Level 3: Fine-grained detail verification.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat.equivNat unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FAGER consistently outperforms prior metrics on the Factual A/B test
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Qwen-image-edit-2511.https : //huggingface.co/Qwen/Qwen- Image- Edit- 2511, 2024
Alibaba Group. Qwen-image-edit-2511.https : //huggingface.co/Qwen/Qwen- Image- Edit- 2511, 2024. Model card. 7
work page 2024
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Imagen 3.arXiv preprint arXiv:2408.07009, 2024
Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, et al. Imagen 3.arXiv preprint arXiv:2408.07009, 2024. 1
-
[4]
Moshe Bar, Karim S Kassam, Avniel Singh Ghuman, Jasmine Boshyan, Annette M Schmid, Anders M Dale, Matti S H¨am¨al¨ainen, Ksenija Marinkovic, Daniel L Schacter, Bruce R Rosen, et al. Top-down facilitation of visual recog- nition.Proceedings of the national academy of sciences, 103 (2):449–454, 2006. 2
work page 2006
-
[5]
Flux.1 [dev].https : / / huggingface
Black Forest Labs. Flux.1 [dev].https : / / huggingface . co / black - forest - labs / FLUX . 1-dev, 2024. 1, 6, 7
work page 2024
-
[6]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space,
Black Forest Labs. FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space,
-
[7]
Dall-eval: Probing the reasoning skills and social biases of text-to- image generation models
Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to- image generation models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3043– 3054, 2023. 2
work page 2023
-
[8]
Abo: Dataset and benchmarks for real-world 3d object un- derstanding
Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object un- derstanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126– 21136, 2022. 5
work page 2022
-
[9]
Gemini 3 pro image preview.https : / / ai
Google. Gemini 3 pro image preview.https : / / ai . google . dev / gemini - api / docs / models / gemini- 3- pro- image- preview, 2026. Google AI for Developers documentation. Accessed: 2026-04-07. 6, 8
work page 2026
-
[10]
Finegrain: Evaluating failure modes of text-to-image models with vision language model judges
Kevin David Hayes, Micah Goldblum, Vikash Sehwag, Gowthami Somepalli, Ashwinee Panda, and Tom Goldstein. Finegrain: Evaluating failure modes of text-to-image models with vision language model judges. InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. 1, 2, 6
work page 2025
-
[11]
Clipscore: A reference-free evaluation met- ric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the 2021 confer- ence on empirical methods in natural language processing, pages 7514–7528, 2021. 2, 8
work page 2021
-
[12]
Tifa: Accu- rate and interpretable text-to-image faithfulness evaluation with question answering
Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accu- rate and interpretable text-to-image faithfulness evaluation with question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406– 20417, 2023. 2, 3
work page 2023
-
[13]
Ziwei Huang, Wanggui He, Quanyu Long, Yandi Wang, Haoyuan Li, Zhelun Yu, Fangxun Shu, Long Chan, Hao Jiang, Fei Wu, et al. T2i-factualbench: Benchmarking the factuality of text-to-image models with knowledge-intensive concepts.arXiv preprint arXiv:2412.04300, 2024. 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [14]
-
[15]
FLUX.2: Frontier Visual Intelligence
Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025. 1, 5, 6
work page 2025
-
[16]
Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Gra- ham Neubig, et al. Genai-bench: Evaluating and improv- ing compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024. 3
-
[17]
Evalu- ating image hallucination in text-to-image generation with question-answering
Youngsun Lim, Hojun Choi, and Hyunjung Shim. Evalu- ating image hallucination in text-to-image generation with question-answering. InProceedings of the AAAI Conference on Artificial Intelligence, pages 26290–26298, 2025. 3, 5
work page 2025
-
[18]
Evaluating text-to-visual generation with image-to-text gen- eration
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text gen- eration. InEuropean Conference on Computer Vision, pages 366–384. Springer, 2024. 1, 2, 6
work page 2024
-
[19]
Dall-e 3 system card.https://openai.com/ index/dall-e-3-system-card/, 2023
OpenAI. Dall-e 3 system card.https://openai.com/ index/dall-e-3-system-card/, 2023. Accessed: August 15, 2024. 5
work page 2023
-
[20]
Update to GPT-5 system card: GPT-5.2
OpenAI. Update to GPT-5 system card: GPT-5.2. Technical report, OpenAI, 2025. 5
work page 2025
-
[21]
OpenAI. Introducing gpt-5.4 mini and nano.https: //openai.com/index/introducing-gpt-5-4- mini-and-nano/, 2026. 4
work page 2026
-
[22]
Philippe G Schyns and Aude Oliva. From blobs to boundary edges: Evidence for time-and spatial-scale-dependent scene recognition.Psychological science, 5(4):195–200, 1994. 2
work page 1994
-
[23]
Stable diffusion 3.5 large.https : / / huggingface
Stability AI. Stable diffusion 3.5 large.https : / / huggingface . co / stabilityai / stable - diffusion-3.5-large, 2024. Model card. 1, 6
work page 2024
-
[24]
Michal Yarom, Yonatan Bitton, Soravit Changpinyo, Roee Aharoni, Jonathan Herzig, Oran Lang, Eran Ofek, and Idan Szpektor. What you see is what you read? improving text- image alignment evaluation.Advances in Neural Informa- tion Processing Systems, 36:1601–1619, 2023. 2
work page 2023
-
[25]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 8 9
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.