arxiv: 2604.04172 · v1 · submitted 2026-04-05 · 💻 cs.CV · cs.AI

Recognition: no theorem link

GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models

Yaohan Guan , Pristina Wang , Najim Dehak , Alan Yuille , Jieneng Chen , Daniel Khashabi

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords GENFIG1benchmarkvision-language modelsfigure generationscientific visualizationgenerative AIvisual summariesmultimodal reasoning

0 comments

The pith

GENFIG1 benchmark shows vision-language models struggle to generate figures summarizing a paper's core idea from text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GENFIG1, a benchmark that tests whether generative models can create the primary visual summary figure for a scientific paper using only its title, abstract, introduction, and figure caption. These figures must clearly express and motivate the central research idea, which demands that models first comprehend technical concepts, then select the most salient ones, and finally design a coherent graphic faithful to the input. The authors curate examples from top deep-learning conferences, add quality controls, and supply an automatic metric that aligns with expert human ratings. Evaluation of current models reveals that even the strongest systems fall short, exposing a gap between text understanding and effective visual synthesis. The benchmark is positioned as a foundation for advancing multimodal AI in scientific communication.

Core claim

We introduce GENFIG1, a benchmark for generative AI models to produce figures that clearly express and motivate the central idea of a paper from its title, abstract, introduction, and figure caption. Solving GENFIG1 requires models to comprehend technical concepts, identify the most salient ones, and design a coherent graphic that conveys those concepts visually. We curate the benchmark from papers at top deep-learning conferences, apply quality control, and introduce an automatic evaluation metric that correlates well with expert human judgments. Evaluation of representative models demonstrates that the task presents significant challenges even for the best-performing systems.

What carries the argument

GENFIG1 benchmark, which measures the coupling of scientific understanding with visual synthesis by requiring generation of a 'Figure 1' summary graphic from paper text alone.

If this is right

Models must demonstrate comprehension of technical concepts directly from text input.
They must select salient ideas and translate them into aesthetically effective visuals.
Current systems fall short on this combined reasoning and synthesis task.
Progress on the benchmark could support automated tools for scientific visual communication.
The task serves as a measurable foundation for future multimodal model development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could motivate new training objectives that explicitly reward conceptual fidelity in generated figures.
It highlights a broader limitation in multimodal models when moving from abstract description to concrete visual explanation.
Extensions might test whether the same models perform better when given additional paper sections such as methods or results.
Success here could reduce the manual iteration scientists currently perform to produce effective summary figures.

Load-bearing premise

The curated papers and automatic evaluation metric accurately measure a model's ability to integrate scientific understanding with visual figure creation.

What would settle it

If human experts consistently judge generated figures as failing to convey the paper's central idea while the automatic metric assigns them high scores, this would undermine the benchmark's claimed correlation with human judgment.

Figures

Figures reproduced from arXiv: 2604.04172 by Alan Yuille, Daniel Khashabi, Jieneng Chen, Najim Dehak, Pristina Wang, Yaohan Guan.

**Figure 2.** Figure 2: Resulted taxonomy of Figure 1s. We define three taxonomies(Overview, Example, and Experimental Results) and multiple sub-taxonomies, where Overview and Example contribute more. Among them, Example–Background and Example–Method are the most frequent, followed by Overview–Model Architecture and Overview–Method. most years. This filtering ensures high data quality and consistency across venues, supporting … view at source ↗

**Figure 3.** Figure 3: Figure 1 examples produced by both humans and models for all baselines from the paper ( [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Examples for taxonomy of Figure 1s. (a) Embeddings clustered by venues (b) Embeddings clustered by fields [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: UMAP visualizations of the paper representations: (a) clustered by venues and (b) clustered by research [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt for GPT-4.1 as a judge 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt for Text-Rich Catastrophic Neglect Score [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

In many science papers, "Figure 1" serves as the primary visual summary of the core research idea. These figures are visually simple yet conceptually rich, often requiring significant effort and iteration by human authors to get right, highlighting the difficulty of science visual communication. With this intuition, we introduce GENFIG1, a benchmark for generative AI models (e.g., Vision-Language Models). GENFIG1 evaluates models for their ability to produce figures that clearly express and motivate the central idea of a paper (title, abstract, introduction, and figure caption) as input. Solving GENFIG1 requires more than producing visually appealing graphics: the task entails reasoning for text-to-image generation that couples scientific understanding with visual synthesis. Specifically, models must (i) comprehend and grasp the technical concepts of the paper, (ii) identify the most salient ones, and (iii) design a coherent and aesthetically effective graphic that conveys those concepts visually and is faithful to the input. We curate the benchmark from papers published at top deep-learning conferences, apply stringent quality control, and introduce an automatic evaluation metric that correlates well with expert human judgments. We evaluate a suite of representative models on GENFIG1 and demonstrate that the task presents significant challenges, even for the best-performing systems. We hope this benchmark serves as a foundation for future progress in multimodal AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces GENFIG1, a benchmark for vision-language models to generate Figure 1-style visual summaries that express the central research idea of a paper, taking as input the title, abstract, introduction, and figure caption. The benchmark is curated from papers at top deep-learning conferences with stringent quality control; an automatic evaluation metric is introduced that is claimed to correlate with expert human judgments; and evaluations of representative models are reported to show that the task remains significantly challenging even for the strongest current systems.

Significance. If the curation process and automatic metric prove reliable, GENFIG1 would offer a targeted test of whether VLMs can couple technical comprehension with visual synthesis for scientific communication, an increasingly relevant capability as multimodal models are deployed in research settings. The focus on conceptual fidelity rather than generic image quality is a constructive framing, and the benchmark could usefully complement existing text-to-image evaluations.

major comments (2)

[Evaluation section] The automatic evaluation metric is asserted to correlate well with expert human judgments, yet no construction details, feature set, training procedure, correlation coefficient (Pearson r or Spearman ρ), or held-out validation statistics are supplied. This omission is load-bearing because the central claim that current models face significant challenges rests entirely on scores produced by this metric.
[Benchmark construction] Dataset statistics are not reported: the total number of papers retained after curation, the precise exclusion criteria applied during quality control, and the distribution across conferences or subfields are absent. Without these quantities it is impossible to judge the benchmark's scale, diversity, or potential selection biases that could affect the reported model rankings.

minor comments (1)

[Abstract] The abstract refers to 'top deep-learning conferences' without naming them; an explicit list (e.g., NeurIPS, ICML, CVPR, ICLR) would improve reproducibility and context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point by point below and will revise the paper accordingly to improve transparency and completeness.

read point-by-point responses

Referee: [Evaluation section] The automatic evaluation metric is asserted to correlate well with expert human judgments, yet no construction details, feature set, training procedure, correlation coefficient (Pearson r or Spearman ρ), or held-out validation statistics are supplied. This omission is load-bearing because the central claim that current models face significant challenges rests entirely on scores produced by this metric.

Authors: We acknowledge that the manuscript does not supply the requested construction details for the automatic evaluation metric. In the revised version we will add a dedicated subsection describing the metric's feature set, training procedure, exact correlation coefficients (Pearson r and Spearman ρ) with expert judgments, and held-out validation statistics. This addition will directly support the claim that current models remain challenged on GENFIG1. revision: yes
Referee: [Benchmark construction] Dataset statistics are not reported: the total number of papers retained after curation, the precise exclusion criteria applied during quality control, and the distribution across conferences or subfields are absent. Without these quantities it is impossible to judge the benchmark's scale, diversity, or potential selection biases that could affect the reported model rankings.

Authors: We agree that explicit dataset statistics are required to assess scale, diversity, and possible biases. The revised manuscript will include a new table and accompanying text reporting the final number of retained papers, the precise exclusion criteria applied during quality control, and the distribution of papers across conferences and subfields. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark curation and metric correlation are externally grounded

full rationale

The paper introduces GENFIG1 by curating published papers from top conferences and defining an automatic metric asserted to correlate with human judgments. No equations, derivations, fitted parameters, or predictions appear in the provided text. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results occurs. The central claim rests on external paper selection and human correlation rather than reducing to self-definition or input-by-construction. Absence of metric construction details is a transparency gap but does not constitute circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a benchmark based on existing published papers without introducing new mathematical parameters, axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5558 in / 1036 out tokens · 38966 ms · 2026-05-13T16:41:18.626775+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 4 internal anchors

[1]

Jonas Belouadi, Anne Lauscher, and Steffen Eger. 2024. https://openreview.net/forum?id=v3K5TVP8kZ AutomaTikZ : Text-guided synthesis of scientific vector graphics with TikZ . In The Twelfth International Conference on Learning Representations

work page 2024
[2]

Yifan Chang, Yukang Feng, Jianwen Sun, Jiaxin Ai, Chuanhao Li, S Kevin Zhou, and Kaipeng Zhang. 2025. Sridbench: Benchmark of scientific research illustration drawing of image generation model. arXiv preprint arXiv:2505.22126

work page arXiv 2025
[3]

Charles Chen, Ruiyi Zhang, Eunyee Koh, Sungchul Kim, Scott Cohen, Tong Yu, Ryan Rossi, and Razvan Bunescu. 2019. https://arxiv.org/abs/1906.02850 Figure captioning with reasoning and sequence-level training . Preprint, arXiv:1906.02850

work page arXiv 2019
[4]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Communications of the ACM, 63(11):139--144

work page 2020
[5]

Paul Grimal, Hervé Le Borgne, Olivier Ferret, and Julien Tourille. 2024. https://arxiv.org/abs/2307.05134 Tiam -- a metric for evaluating alignment in text-to-image generation . Preprint, arXiv:2307.05134

work page arXiv 2024
[6]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840--6851

work page 2020
[7]

Lee Giles, and Ting-Hao 'Kenneth' Huang

Ting-Yao Hsu, C. Lee Giles, and Ting-Hao 'Kenneth' Huang. 2021. https://arxiv.org/abs/2110.11624 Scicap: Generating captions for scientific figures . Preprint, arXiv:2110.11624

work page arXiv 2021
[8]

Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, and Fei Huang. 2024. https://arxiv.org/abs/2311.18248 mplug-paperowl: Scientific diagram analysis with the multimodal large language model . Preprint, arXiv:2311.18248

work page arXiv 2024
[9]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Naoto Inoue, Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yamaguchi. 2023. Layoutdm: Discrete diffusion model for controllable layout generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10167--10176

work page 2023
[11]

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. 2018. https://arxiv.org/abs/1801.08163 Dvqa: Understanding data visualizations via question answering . Preprint, arXiv:1801.08163

work page arXiv 2018
[12]

Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Akos Kadar, Adam Trischler, and Yoshua Bengio. 2018. https://arxiv.org/abs/1710.07300 Figureqa: An annotated figure dataset for visual reasoning . Preprint, arXiv:1710.07300

work page Pith review arXiv 2018
[13]

Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. 2022. Denoising diffusion restoration models. In Advances in Neural Information Processing Systems

work page 2022
[14]

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. 2024. https://arxiv.org/abs/2403.00231 Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models . Preprint, arXiv:2403.00231

work page arXiv 2024
[15]

Wilson, Woosang Lim, and William Yang Wang

Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, Linda Ruth Petzold, Stephen D. Wilson, Woosang Lim, and William Yang Wang. 2025. https://arxiv.org/abs/2407.04903 Mmsci: A dataset for graduate-level multi-discipline multimodal scientific understanding . Preprint, arXiv:2407.04903

work page arXiv 2025
[16]

Wenhao Liu, Xiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, and Xuanjing Huang. 2023. https://api.semanticscholar.org/CorpusID:266551232 Aligning large language models with human preferences through representation engineering . ArXiv, abs/2312.15997

work page arXiv 2023
[17]

Wenhao Liu, Xiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, and Xuanjing Huang. 2024. Aligning large language models with human preferences through representation engineering. In ACL (1)

work page 2024
[18]

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. 2022. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461--11471

work page 2022
[19]

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, and 1 others. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Pebble Authors . 2025. Pebble : Pebble provides a neat api to manage threads and processes within an application. https://github.com/noxdafox/pebble. GitHub repository, accessed 2025-08-01

work page 2025
[22]

plasTeX Development Team . 2025. plasTeX : A latex compiler written in python. https://github.com/plastex/plastex. GitHub repository, accessed 2025-08-01

work page 2025
[23]

PyLaTeX Contributors . 2025. PyLaTeX : A python library for creating and compiling latex files or snippets. https://github.com/JelteF/PyLaTeX. GitHub repository, accessed 2025-08-01

work page 2025
[24]

Jonathan Roberts, Kai Han, Neil Houlsby, and Samuel Albanie. 2024. https://arxiv.org/abs/2405.08807 Scifibench: Benchmarking large multimodal models for scientific figure interpretation . Preprint, arXiv:2405.08807

work page arXiv 2024
[26]

Starvector: Generating scalable vector graphics code from images.arXiv preprint arXiv:2312.11556, 2023

Juan A. Rodriguez, Abhay Puri, Shubham Agarwal, Issam H. Laradji, Pau Rodriguez, Sai Rajeswar, David Vazquez, Christopher Pal, and Marco Pedersoli. 2024. https://arxiv.org/abs/2312.11556 Starvector: Generating scalable vector graphics code from images and text . Preprint, arXiv:2312.11556

work page arXiv 2024
[27]

Rodriguez, David Vazquez, Issam Laradji, Marco Pedersoli, and Pau Rodriguez

Juan A. Rodriguez, David Vazquez, Issam Laradji, Marco Pedersoli, and Pau Rodriguez. 2022. https://arxiv.org/abs/2210.11248 Ocr-vqgan: Taming text-within-image generation . Preprint, arXiv:2210.11248

work page arXiv 2022
[28]

Juan A Rodriguez, David Vazquez, Issam Laradji, Marco Pedersoli, and Pau Rodriguez. 2023 b . https://arxiv.org/abs/2306.00800 Figgen: Text to scientific figure generation . Preprint, arXiv:2306.00800

work page arXiv 2023
[29]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684--10695

work page 2022
[30]

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, and 1 others. 2022 a . Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. 2022 b . Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence

work page 2022
[32]

Noah Siegel, Zachary Horvitz, Roie Levin, Santosh Kumar Divvala, and Ali Farhadi. 2016. https://api.semanticscholar.org/CorpusID:7857660 Figureseer: Parsing result-figures in research papers . In European Conference on Computer Vision

work page 2016
[33]

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, and 1 others. 2023. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations

work page 2023
[34]

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256--2265. pmlr

work page 2015
[35]

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. 2024. https://arxiv.org/abs/2406.18521 Charxiv: Charting gaps in realistic chart understanding in multimodal llms . Preprint, arXiv:2406.18521

work page arXiv 2024
[36]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

work page 2022
[37]

Bohong Wu, Zhuosheng Zhang, Jinyuan Wang, and Hai Zhao. 2022. Sentence-aware contrastive learning for open-domain passage retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1062--1074

work page 2022
[38]

Zhishen Yang, Raj Dabre, Hideki Tanaka, and Naoaki Okazaki. 2023. https://arxiv.org/abs/2306.03491 Scicap+: A knowledge augmented dataset to study the challenges of scientific figure captioning . Preprint, arXiv:2306.03491

work page arXiv 2023
[39]

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836--3847

work page 2023
[40]

Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, and Xi Li. 2023. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22490--22499

work page 2023
[41]

Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, and Jinsung Yoon. 2026 a . https://arxiv.org/abs/2601.23265 Paperbanana: Automating academic illustration for ai scientists . Preprint, arXiv:2601.23265

work page arXiv 2026
[42]

Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, and Yue Zhang. 2026 b . Autofigure: Generating and refining publication-ready scientific illustrations. arXiv preprint arXiv:2602.03828

work page arXiv 2026
[43]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[44]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page