Any2Poster: Any-Source Poster Generation Across Modalities and Domains

Aiden Li; Amogh Vinaykumar; Shilong Liu; Suozhi Huang

arxiv: 2606.02915 · v1 · pith:54BAW5B4new · submitted 2026-06-01 · 💻 cs.CV

Any2Poster: Any-Source Poster Generation Across Modalities and Domains

Amogh Vinaykumar , Aiden Li , Suozhi Huang , Shilong Liu This is my paper

Pith reviewed 2026-06-28 14:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords poster generationmultimodal inputsbenchmark evaluationAI agentlayout planningvisual feedbackcontent condensationcross-domain generation

0 comments

The pith

A single agent generates posters from any input source across modalities and domains using parsing, planning, and visual feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Any2Poster Bench, a new evaluation resource that tests poster generation from eight input types including PDFs, videos, notebooks, and documents, across five content domains. It pairs each source with quiz questions that check factual retention and interpretive understanding, plus judgments of layout and visual quality. The authors also present Any2Poster Agent, an end-to-end system that reads the source, selects key content, designs the layout, renders the poster, and improves it through iterative visual checks. This combination moves evaluation beyond earlier paper-only or narrow-domain tests and supplies both a reusable benchmark and a working baseline system.

Core claim

Any2Poster Bench enables reproducible measurement of poster generation from heterogeneous sources by combining quiz probes for information fidelity with VLM assessments of visual communication; Any2Poster Agent demonstrates the feasibility of this approach by automatically handling parsing of diverse inputs, content organization, layout planning, rendering, and refinement, achieving consistent performance across modalities and domains.

What carries the argument

Any2Poster Agent, an end-to-end pipeline that parses heterogeneous sources, organizes salient content, plans layouts, renders posters, and refines them using visual feedback.

If this is right

Poster generation systems can be evaluated on non-text inputs such as videos and notebooks rather than papers alone.
Performance can be tracked separately for factual accuracy and for layout or readability quality.
Iterative visual feedback loops become a standard component in multimodal content creation pipelines.
Domain-general agents become testable baselines for tasks that require condensing dense information into visual form.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agent structure could extend to generating other condensed visual formats such as infographics or slide decks from mixed sources.
Quiz-based evaluation could transfer to measuring retention from other AI-generated summaries or diagrams.
If the refinement step scales, it suggests a general pattern for agents that improve outputs through self-critique on visual tasks.

Load-bearing premise

VLM judgments of visual quality together with quiz probes provide a valid measure of information fidelity and visual communication quality.

What would settle it

A controlled human study in which viewers answer the benchmark quiz questions after seeing only the generated poster, compared against answers from the original source.

Figures

Figures reproduced from arXiv: 2606.02915 by Aiden Li, Amogh Vinaykumar, Shilong Liu, Suozhi Huang.

**Figure 2.** Figure 2: Overview of Any2Poster Bench. The benchmark covers eight input modalities and five content domains, and evaluates generated posters through BenchQuiz, a genre-aware multiple-choice reading-comprehension protocol answered by VLM readers. evaluating both quiz-based information recovery and VLM-as-judge visual communication quality across multiple modalities and content domains. 3 Any2Poster Bench We introduc… view at source ↗

**Figure 3.** Figure 3: Overview of Any2Poster Agent. The pipeline has three main components: (1) Unified Parsing, which converts heterogeneous inputs into a shared structured representation; (2) Contentadaptive Poster Planning, which analyzes source content, assigns panel roles and layout modes, and decides whether to preserve or generate visuals; and (3) HTML/CSS-based Rendering, which compiles an editable poster and applies V… view at source ↗

**Figure 4.** Figure 4: Qualitative examples comparing generated posters from different systems on the same [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

Visual posters are a compact medium for communicating dense information, yet progress on automatic poster generation remains difficult to measure because existing evaluations are often restricted to paper-only inputs, narrow domains, or surface-level visual similarity. We introduce Any2Poster Bench, a benchmark for any-source poster generation that evaluates systems across eight input modalities--PDFs, URLs, PPTX, DOCX, Markdown, LaTeX, notebooks, and videos--and five content domains. Any2Poster Bench pairs each source with quiz-based probes of verbatim factual retention and interpretive understanding, together with VLM-based judgments of visual quality, layout, readability, content completeness, and logical flow, enabling reproducible assessment of both information fidelity and visual communication. To instantiate and validate this benchmark, we further present Any2Poster Agent, an end-to-end reference agent that parses heterogeneous sources, organizes salient content, plans poster layouts, renders posters, and iteratively refines them using visual feedback. On Any2Poster Bench, Any2Poster Agent achieves 87.25% average accuracy across input modalities and 87.28% across content domains. On PaperQuiz-style evaluation, where prior paper-to-poster agents are directly comparable, Any2Poster Agent improves over PosterAgent-4o from 51.06-51.33% to 72.58% overall accuracy and from 116-121 to 145.16 in density-augmented score. Together, Any2Poster Bench and Any2Poster Agent provide a reusable evaluation resource and a competitive baseline for studying multimodal, domain-general poster generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Any2Poster gives a new multi-modality benchmark and baseline agent for poster generation, but the VLM quality scores lack human validation so the headline numbers are hard to trust.

read the letter

The main things to know are the new benchmark spanning eight input types and the agent that posts clear gains on the paper-to-poster overlap task. Any2Poster Bench adds PDFs, URLs, PPTX, DOCX, Markdown, LaTeX, notebooks and videos plus five domains, with quiz probes for facts and VLM rubrics for layout and flow. The agent reaches 87.25% modality accuracy and lifts the prior baseline from 51% to 72.58% on the shared quiz metric.

What the paper does well is define a single evaluation protocol that covers more source formats than earlier paper-only setups. The end-to-end agent description covers parsing, content selection, layout planning, rendering and visual feedback loops, and the numbers are reported against a named prior system. That gives a concrete starting point for anyone tracking progress in this narrow slice of multimodal summarization.

The soft spot is the evaluation itself. The visual quality, readability and logical flow scores come from VLM judgments with no reported human correlation, inter-rater numbers or ablation against actual viewer comprehension. The stress-test note is correct here: if the VLM favors the agent's style choices but people do not, the 87% and density-augmented scores do not show real communication gains. The factual quiz part is more straightforward, but the visual-communication claims rest on the unanchored signal.

This is for groups building tools that turn mixed documents or media into posters. Readers who need a standardized testbed for any-source generation will get usable numbers and a reusable benchmark definition. It is coherent on its own terms and the citation pattern to prior poster work is straightforward.

I would send it to peer review. The benchmark scope is new enough to justify referee time, provided the authors add human validation for the VLM metrics.

Referee Report

1 major / 0 minor

Summary. The paper introduces Any2Poster Bench, a benchmark for any-source poster generation across eight input modalities (PDFs, URLs, PPTX, DOCX, Markdown, LaTeX, notebooks, videos) and five content domains. The benchmark uses quiz-based probes for factual retention and interpretive understanding alongside VLM-based judgments of visual quality, layout, readability, content completeness, and logical flow. It also presents Any2Poster Agent, an end-to-end reference agent for parsing sources, organizing content, planning layouts, rendering, and iterative refinement, reporting 87.25% average accuracy across modalities, 87.28% across domains, and improvements over PosterAgent-4o (51.06-51.33% to 72.58% overall accuracy; 116-121 to 145.16 density-augmented score) on PaperQuiz-style evaluation.

Significance. If the VLM judgments prove reliable, the benchmark and agent would supply a standardized, reproducible resource for multimodal poster generation that addresses gaps in prior paper-only or narrow-domain evaluations, with the reported gains providing a competitive baseline.

major comments (1)

[§3–4] §3–4 (benchmark construction, as referenced in abstract): The central performance claims (87.25% modality accuracy, 72.58% PaperQuiz accuracy) rest on VLM-based judgments of visual quality/layout/readability/completeness/flow and quiz probes, yet no human correlation, inter-annotator agreement, or ablation linking VLM scores to viewer comprehension/preference is reported. This is load-bearing because the skeptic concern directly questions whether these automated signals validly measure poster effectiveness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on evaluation validity. We address the concern point-by-point below.

read point-by-point responses

Referee: [§3–4] §3–4 (benchmark construction, as referenced in abstract): The central performance claims (87.25% modality accuracy, 72.58% PaperQuiz accuracy) rest on VLM-based judgments of visual quality/layout/readability/completeness/flow and quiz probes, yet no human correlation, inter-annotator agreement, or ablation linking VLM scores to viewer comprehension/preference is reported. This is load-bearing because the skeptic concern directly questions whether these automated signals validly measure poster effectiveness.

Authors: The 87.25% modality and 72.58% PaperQuiz accuracy figures are computed exclusively from the quiz probes, which are manually authored factual and interpretive questions; these provide an objective, human-grounded signal of information retention that does not depend on VLM output. VLM judgments are used only for the separate visual-quality, layout, readability, completeness, and flow dimensions. We acknowledge that the manuscript does not report a new human correlation study or inter-annotator agreement for the VLM component. In revision we will add a dedicated paragraph in §4 discussing VLM reliability, citing prior work on VLM-human alignment for layout and visual tasks, and we will include a small-scale human preference correlation on a held-out subset of posters. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark and performance claims are self-contained without reduction to fitted inputs or self-citations.

full rationale

The paper introduces Any2Poster Bench (quiz probes + VLM judgments) and Any2Poster Agent as a reference system, then reports accuracy numbers on that benchmark. No equations, fitted parameters, or derivation steps appear in the provided text. The central claims rest on external evaluation signals (quizzes and VLM rubrics) rather than any quantity that reduces by construction to the agent's own outputs or prior self-citations. Self-citation load-bearing, ansatz smuggling, and renaming patterns are absent. The evaluation methodology is stated explicitly and does not collapse into tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Only the abstract is available, so the ledger records the evaluation assumptions stated there. The central performance claims rest on the validity of the introduced benchmark and its metrics rather than on any derivation from prior equations.

axioms (2)

domain assumption VLM-based judgments reliably measure visual quality, layout, readability, content completeness, and logical flow
The benchmark construction in the abstract relies on these judgments for reproducible assessment.
domain assumption Quiz-based probes accurately capture verbatim factual retention and interpretive understanding
The abstract states that each source is paired with such probes to evaluate information fidelity.

invented entities (2)

Any2Poster Bench no independent evidence
purpose: Benchmark for any-source poster generation across modalities and domains
New evaluation resource introduced in the abstract.
Any2Poster Agent no independent evidence
purpose: End-to-end reference agent for parsing sources, planning layouts, rendering, and refining posters
New baseline system presented in the abstract.

pith-pipeline@v0.9.1-grok · 5821 in / 1572 out tokens · 32536 ms · 2026-06-28T14:52:40.593902+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 14 canonical work pages · 4 internal anchors

[1]

Croissant: A metadata format for ml-ready datasets, 2024

Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Luca Foschini, Joan Giner-Miguelez, et al. Croissant: A metadata format for ml-ready datasets, 2024

2024
[2]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision, 2015

2015
[3]

Qwen2.5-VL technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2.5-VL technical report, 2025

2025
[4]

Enhancing presentation slide generation by llms with a multi-staged end-to-end approach

Sambaran Bandyopadhyay, Himanshu Maheshwari, Anandhavelu Natarajan, and Apoorv Sax- ena. Enhancing presentation slide generation by llms with a multi-staged end-to-end approach. InProceedings of the 17th International Natural Language Generation Conference, pages 222–229. Association for Computational Linguistics, 2024

2024
[5]

Nougat: Neural Optical Understanding for Academic Documents

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents.arXiv preprint arXiv:2308.13418, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

POSTA: A go-to framework for customized artistic poster generation

Haoyu Chen, Xiaojie Xu, Wenbo Li, Jingjing Ren, Tian Ye, Songhua Liu, Ying-Cong Chen, Lei Zhu, and Xinchao Wang. POSTA: A go-to framework for customized artistic poster generation. arXiv preprint arXiv:2503.14908, 2025

work page arXiv 2025
[7]

Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities, 2025

2025
[8]

PAL: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. PAL: Program-aided language models. InProceedings of the 40th International Conference on Machine Learning, 2023

2023
[9]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

2021
[10]

Paper2Slides: From paper to presentation in one click

HKUDS. Paper2Slides: From paper to presentation in one click. https://github.com/ HKUDS/Paper2Slides, 2025. Software project

2025
[11]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019

2019
[12]

Text2Poster: Laying out stylized texts on retrieved images

Chuhao Jin, Hongteng Xu, Ruihua Song, and Zhiwu Lu. Text2Poster: Laying out stylized texts on retrieved images. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 4823–4827. IEEE, 2022

2022
[13]

WebDraw: A machine learning-driven tool for automatic website prototyping.Science of Computer Programming, 233:103056, 2024

Thisaranie Kaluarachchi and Manjusri Wickramasinghe. WebDraw: A machine learning-driven tool for automatic website prototyping.Science of Computer Programming, 233:103056, 2024

2024
[14]

Relation-aware diffusion model for controllable poster layout generation.arXiv preprint arXiv:2306.09086, 2023

Fengheng Li, An Liu, Wei Feng, Honghe Zhu, Yaoyu Li, Zheng Zhang, Jingjing Lv, Xin Zhu, Junjie Shen, Zhangang Lin, and Jingping Shao. Relation-aware diffusion model for controllable poster layout generation.arXiv preprint arXiv:2306.09086, 2023

work page arXiv 2023
[15]

Planning and rendering: Towards product poster generation with diffusion models.arXiv preprint arXiv:2312.08822, 2024

Zhaochen Li, Fengheng Li, Wei Feng, Honghe Zhu, Yaoyu Li, Zheng Zhang, Jingjing Lv, Junjie Shen, Zhangang Lin, Jingping Shao, and Zhenglu Yang. Planning and rendering: Towards product poster generation with diffusion models.arXiv preprint arXiv:2312.08822, 2024

work page arXiv 2024
[16]

Holistic evaluation of language models, 2022

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models, 2022

2022
[17]

LayoutPrompter: Awaken the design ability of large language models.arXiv preprint arXiv:2311.06495, 2023

Jiawei Lin, Jiaqi Guo, Shizhao Sun, Zijiang James Yang, Jian-Guang Lou, and Dongmei Zhang. LayoutPrompter: Awaken the design ability of large language models.arXiv preprint arXiv:2311.06495, 2023. 10

work page arXiv 2023
[18]

G-Eval: Nlg evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, 2023

2023
[19]

Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Cesar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, Shubham Gupta, Rafael Teixeira de Lima, Valery Weber, Lucas Morin, Ingmar Meijer, Viktor Kuropiatnyk, and Peter W. J. Staar. Docling: An efficient open-source toolkit for ai-driven document convers...

work page arXiv 2025
[20]

OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou. Octo- Tools: An agentic framework with extensible tools for complex reasoning.arXiv preprint arXiv:2502.11271, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Ui layout generation with llms guided by ui grammar.arXiv preprint arXiv:2310.15455, 2023

Yuwen Lu, Ziang Tong, Qinyi Zhao, Chengzhi Zhang, and Toby Jia-Jun Li. Ui layout generation with llms guided by ui grammar.arXiv preprint arXiv:2310.15455, 2023

work page arXiv 2023
[22]

GlyphDraw2: Automatic generation of complex glyph posters with diffusion models and large language models.arXiv preprint arXiv:2407.02252, 2024

Jian Ma, Yonglin Deng, Chen Chen, Nanyang Du, Haonan Lu, and Zhenyu Yang. GlyphDraw2: Automatic generation of complex glyph posters with diffusion models and large language models.arXiv preprint arXiv:2407.02252, 2024

work page arXiv 2024
[23]

Presentations by the humans and for the humans: Harnessing llms for generating persona-aware slides from documents

Ishani Mondal, Shwetha S, Anandhavelu Natarajan, Aparna Garimella, Sambaran Bandyopad- hyay, and Jordan Boyd-Graber. Presentations by the humans and for the humans: Harnessing llms for generating persona-aware slides from documents. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, pages 2664–26...

2024
[24]

GPT-4o system card

OpenAI. GPT-4o system card. https://openai.com/index/gpt-4o-system-card/ , 2024

2024
[25]

GPT-5 system card.https://openai.com/index/gpt-5-system-card/, 2025

OpenAI. GPT-5 system card.https://openai.com/index/gpt-5-system-card/, 2025

2025
[26]

Paper2Poster: Towards multimodal poster automation from scientific papers, 2025

Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, and Philip Torr. Paper2Poster: Towards multimodal poster automation from scientific papers, 2025

2025
[27]

ART: Automatic multi-step reasoning and tool-use for large language models

Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. ART: Automatic multi-step reasoning and tool-use for large language models.arXiv preprint arXiv:2303.09014, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

marker: Convert pdf to markdown and json quickly with high accuracy

Vik Paruchuri. marker: Convert pdf to markdown and json quickly with high accuracy. https: //github.com/VikParuchuri/marker, 2025. Software project

2025
[29]

Nassar, and Peter W

Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter W. J. Staar. DocLayNet: A large human-annotated dataset for document-layout analysis. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3743–3751, 2022

2022
[30]

Data cards: Purposeful and trans- parent dataset documentation for responsible ai

Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data cards: Purposeful and trans- parent dataset documentation for responsible ai. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1776–1826, 2022

2022
[31]

Learning to generate posters of scientific papers by probabilistic graphical models, 2017

Yu-ting Qiang, Yanwei Fu, Xiao Yu, Yanwen Guo, Zhi-Hua Zhou, and Leonid Sigal. Learning to generate posters of scientific papers by probabilistic graphical models, 2017

2017
[32]

Rodriguez, Xiangru Jian, Siba Smarak Panigrahi, Tianyu Zhang, Aarash Feizi, Abhay Puri, Akshay Kalkunte Suresh, François Savard, Ahmed Masry, Shravan Nayak, et al

Juan A. Rodriguez, Xiangru Jian, Siba Smarak Panigrahi, Tianyu Zhang, Aarash Feizi, Abhay Puri, Akshay Kalkunte Suresh, François Savard, Ahmed Masry, Shravan Nayak, et al. BigDocs: An open dataset for training multimodal models on document and code tasks. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[33]

PosterSum: A multimodal benchmark for scientific poster summarization.arXiv preprint arXiv:2502.17540, 2025

Rohit Saxena, Pasquale Minervini, and Frank Keller. PosterSum: A multimodal benchmark for scientific poster summarization.arXiv preprint arXiv:2502.17540, 2025

work page arXiv 2025
[34]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

SlideGen: An abstractive section-based slide generator for scholarly documents

Athar Sefid, Prasenjit Mitra, and Lee Giles. SlideGen: An abstractive section-based slide generator for scholarly documents. InProceedings of the 21st ACM Symposium on Document Engineering. Association for Computing Machinery, 2021

2021
[36]

De- sign2Code: Benchmarking multimodal code generation for automated front-end engineering

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. De- sign2Code: Benchmarking multimodal code generation for automated front-end engineering. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3956–3974. Association fo...

2025
[37]

Edward Sun, Yufang Hou, Dakuo Wang, Yunfeng Zhang, and Nancy X. R. Wang. D2S: Document-to-slide generation via query-based text summarization. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1405–1418. Association for Computational Linguistics, 2021

2021
[38]

P2P: Automated paper-to-poster generation and fine-grained benchmark, 2025

Tao Sun, Enhao Pan, Zhengkai Yang, Kaixin Sui, Jiajun Shi, Xianfu Cheng, Tongliang Li, Wenhao Huang, Ge Zhang, Jian Yang, and Zhoujun Li. P2P: Automated paper-to-poster generation and fine-grained benchmark, 2025

2025
[39]

PosterBot: A system for generating posters of scientific papers with neural models

Sheng Xu and Xiaojun Wan. PosterBot: A system for generating posters of scientific papers with neural models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 13233–13235, 2022

2022
[40]

Poster- LLaV A: Constructing a unified multi-modal layout generator with llm.arXiv preprint arXiv:2406.02884, 2024

Tao Yang, Yingmin Luo, Zhongang Qi, Yang Wu, Ying Shan, and Chang Wen Chen. Poster- LLaV A: Constructing a unified multi-modal layout generator with llm.arXiv preprint arXiv:2406.02884, 2024

work page arXiv 2024
[41]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

2023
[42]

PosterGen: Aesthetic-aware paper-to-poster generation via multi-agent llms, 2025

Zhilin Zhang, Xiang Zhang, Jiaqi Wei, Yiwei Xu, and Chenyu You. PosterGen: Aesthetic-aware paper-to-poster generation via multi-agent llms, 2025

2025
[43]

PPTAgent: Generating and evaluating presentations beyond text-to-slides.arXiv preprint arXiv:2501.03936, 2025

Hao Zheng, Xinyan Guan, Hao Kong, Jia Zheng, Weixiang Zhou, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, and Le Sun. PPTAgent: Generating and evaluating presentations beyond text-to-slides.arXiv preprint arXiv:2501.03936, 2025

work page arXiv 2025
[44]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

2023
[45]

issues": [

Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. PubLayNet: Largest dataset ever for document layout analysis. In2019 International Conference on Document Analysis and Recognition, pages 1015–1022, 2019. 12 Appendix Contents ADataset Details .....................................................................................................................

2019

[1] [1]

Croissant: A metadata format for ml-ready datasets, 2024

Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Luca Foschini, Joan Giner-Miguelez, et al. Croissant: A metadata format for ml-ready datasets, 2024

2024

[2] [2]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision, 2015

2015

[3] [3]

Qwen2.5-VL technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2.5-VL technical report, 2025

2025

[4] [4]

Enhancing presentation slide generation by llms with a multi-staged end-to-end approach

Sambaran Bandyopadhyay, Himanshu Maheshwari, Anandhavelu Natarajan, and Apoorv Sax- ena. Enhancing presentation slide generation by llms with a multi-staged end-to-end approach. InProceedings of the 17th International Natural Language Generation Conference, pages 222–229. Association for Computational Linguistics, 2024

2024

[5] [5]

Nougat: Neural Optical Understanding for Academic Documents

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents.arXiv preprint arXiv:2308.13418, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

POSTA: A go-to framework for customized artistic poster generation

Haoyu Chen, Xiaojie Xu, Wenbo Li, Jingjing Ren, Tian Ye, Songhua Liu, Ying-Cong Chen, Lei Zhu, and Xinchao Wang. POSTA: A go-to framework for customized artistic poster generation. arXiv preprint arXiv:2503.14908, 2025

work page arXiv 2025

[7] [7]

Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities, 2025

2025

[8] [8]

PAL: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. PAL: Program-aided language models. InProceedings of the 40th International Conference on Machine Learning, 2023

2023

[9] [9]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

2021

[10] [10]

Paper2Slides: From paper to presentation in one click

HKUDS. Paper2Slides: From paper to presentation in one click. https://github.com/ HKUDS/Paper2Slides, 2025. Software project

2025

[11] [11]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019

2019

[12] [12]

Text2Poster: Laying out stylized texts on retrieved images

Chuhao Jin, Hongteng Xu, Ruihua Song, and Zhiwu Lu. Text2Poster: Laying out stylized texts on retrieved images. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 4823–4827. IEEE, 2022

2022

[13] [13]

WebDraw: A machine learning-driven tool for automatic website prototyping.Science of Computer Programming, 233:103056, 2024

Thisaranie Kaluarachchi and Manjusri Wickramasinghe. WebDraw: A machine learning-driven tool for automatic website prototyping.Science of Computer Programming, 233:103056, 2024

2024

[14] [14]

Relation-aware diffusion model for controllable poster layout generation.arXiv preprint arXiv:2306.09086, 2023

Fengheng Li, An Liu, Wei Feng, Honghe Zhu, Yaoyu Li, Zheng Zhang, Jingjing Lv, Xin Zhu, Junjie Shen, Zhangang Lin, and Jingping Shao. Relation-aware diffusion model for controllable poster layout generation.arXiv preprint arXiv:2306.09086, 2023

work page arXiv 2023

[15] [15]

Planning and rendering: Towards product poster generation with diffusion models.arXiv preprint arXiv:2312.08822, 2024

Zhaochen Li, Fengheng Li, Wei Feng, Honghe Zhu, Yaoyu Li, Zheng Zhang, Jingjing Lv, Junjie Shen, Zhangang Lin, Jingping Shao, and Zhenglu Yang. Planning and rendering: Towards product poster generation with diffusion models.arXiv preprint arXiv:2312.08822, 2024

work page arXiv 2024

[16] [16]

Holistic evaluation of language models, 2022

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models, 2022

2022

[17] [17]

LayoutPrompter: Awaken the design ability of large language models.arXiv preprint arXiv:2311.06495, 2023

Jiawei Lin, Jiaqi Guo, Shizhao Sun, Zijiang James Yang, Jian-Guang Lou, and Dongmei Zhang. LayoutPrompter: Awaken the design ability of large language models.arXiv preprint arXiv:2311.06495, 2023. 10

work page arXiv 2023

[18] [18]

G-Eval: Nlg evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, 2023

2023

[19] [19]

Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Cesar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, Shubham Gupta, Rafael Teixeira de Lima, Valery Weber, Lucas Morin, Ingmar Meijer, Viktor Kuropiatnyk, and Peter W. J. Staar. Docling: An efficient open-source toolkit for ai-driven document convers...

work page arXiv 2025

[20] [20]

OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou. Octo- Tools: An agentic framework with extensible tools for complex reasoning.arXiv preprint arXiv:2502.11271, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Ui layout generation with llms guided by ui grammar.arXiv preprint arXiv:2310.15455, 2023

Yuwen Lu, Ziang Tong, Qinyi Zhao, Chengzhi Zhang, and Toby Jia-Jun Li. Ui layout generation with llms guided by ui grammar.arXiv preprint arXiv:2310.15455, 2023

work page arXiv 2023

[22] [22]

GlyphDraw2: Automatic generation of complex glyph posters with diffusion models and large language models.arXiv preprint arXiv:2407.02252, 2024

Jian Ma, Yonglin Deng, Chen Chen, Nanyang Du, Haonan Lu, and Zhenyu Yang. GlyphDraw2: Automatic generation of complex glyph posters with diffusion models and large language models.arXiv preprint arXiv:2407.02252, 2024

work page arXiv 2024

[23] [23]

Presentations by the humans and for the humans: Harnessing llms for generating persona-aware slides from documents

Ishani Mondal, Shwetha S, Anandhavelu Natarajan, Aparna Garimella, Sambaran Bandyopad- hyay, and Jordan Boyd-Graber. Presentations by the humans and for the humans: Harnessing llms for generating persona-aware slides from documents. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, pages 2664–26...

2024

[24] [24]

GPT-4o system card

OpenAI. GPT-4o system card. https://openai.com/index/gpt-4o-system-card/ , 2024

2024

[25] [25]

GPT-5 system card.https://openai.com/index/gpt-5-system-card/, 2025

OpenAI. GPT-5 system card.https://openai.com/index/gpt-5-system-card/, 2025

2025

[26] [26]

Paper2Poster: Towards multimodal poster automation from scientific papers, 2025

Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, and Philip Torr. Paper2Poster: Towards multimodal poster automation from scientific papers, 2025

2025

[27] [27]

ART: Automatic multi-step reasoning and tool-use for large language models

Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. ART: Automatic multi-step reasoning and tool-use for large language models.arXiv preprint arXiv:2303.09014, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

marker: Convert pdf to markdown and json quickly with high accuracy

Vik Paruchuri. marker: Convert pdf to markdown and json quickly with high accuracy. https: //github.com/VikParuchuri/marker, 2025. Software project

2025

[29] [29]

Nassar, and Peter W

Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter W. J. Staar. DocLayNet: A large human-annotated dataset for document-layout analysis. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3743–3751, 2022

2022

[30] [30]

Data cards: Purposeful and trans- parent dataset documentation for responsible ai

Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data cards: Purposeful and trans- parent dataset documentation for responsible ai. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1776–1826, 2022

2022

[31] [31]

Learning to generate posters of scientific papers by probabilistic graphical models, 2017

Yu-ting Qiang, Yanwei Fu, Xiao Yu, Yanwen Guo, Zhi-Hua Zhou, and Leonid Sigal. Learning to generate posters of scientific papers by probabilistic graphical models, 2017

2017

[32] [32]

Rodriguez, Xiangru Jian, Siba Smarak Panigrahi, Tianyu Zhang, Aarash Feizi, Abhay Puri, Akshay Kalkunte Suresh, François Savard, Ahmed Masry, Shravan Nayak, et al

Juan A. Rodriguez, Xiangru Jian, Siba Smarak Panigrahi, Tianyu Zhang, Aarash Feizi, Abhay Puri, Akshay Kalkunte Suresh, François Savard, Ahmed Masry, Shravan Nayak, et al. BigDocs: An open dataset for training multimodal models on document and code tasks. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[33] [33]

PosterSum: A multimodal benchmark for scientific poster summarization.arXiv preprint arXiv:2502.17540, 2025

Rohit Saxena, Pasquale Minervini, and Frank Keller. PosterSum: A multimodal benchmark for scientific poster summarization.arXiv preprint arXiv:2502.17540, 2025

work page arXiv 2025

[34] [34]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

SlideGen: An abstractive section-based slide generator for scholarly documents

Athar Sefid, Prasenjit Mitra, and Lee Giles. SlideGen: An abstractive section-based slide generator for scholarly documents. InProceedings of the 21st ACM Symposium on Document Engineering. Association for Computing Machinery, 2021

2021

[36] [36]

De- sign2Code: Benchmarking multimodal code generation for automated front-end engineering

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. De- sign2Code: Benchmarking multimodal code generation for automated front-end engineering. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3956–3974. Association fo...

2025

[37] [37]

Edward Sun, Yufang Hou, Dakuo Wang, Yunfeng Zhang, and Nancy X. R. Wang. D2S: Document-to-slide generation via query-based text summarization. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1405–1418. Association for Computational Linguistics, 2021

2021

[38] [38]

P2P: Automated paper-to-poster generation and fine-grained benchmark, 2025

Tao Sun, Enhao Pan, Zhengkai Yang, Kaixin Sui, Jiajun Shi, Xianfu Cheng, Tongliang Li, Wenhao Huang, Ge Zhang, Jian Yang, and Zhoujun Li. P2P: Automated paper-to-poster generation and fine-grained benchmark, 2025

2025

[39] [39]

PosterBot: A system for generating posters of scientific papers with neural models

Sheng Xu and Xiaojun Wan. PosterBot: A system for generating posters of scientific papers with neural models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 13233–13235, 2022

2022

[40] [40]

Poster- LLaV A: Constructing a unified multi-modal layout generator with llm.arXiv preprint arXiv:2406.02884, 2024

Tao Yang, Yingmin Luo, Zhongang Qi, Yang Wu, Ying Shan, and Chang Wen Chen. Poster- LLaV A: Constructing a unified multi-modal layout generator with llm.arXiv preprint arXiv:2406.02884, 2024

work page arXiv 2024

[41] [41]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

2023

[42] [42]

PosterGen: Aesthetic-aware paper-to-poster generation via multi-agent llms, 2025

Zhilin Zhang, Xiang Zhang, Jiaqi Wei, Yiwei Xu, and Chenyu You. PosterGen: Aesthetic-aware paper-to-poster generation via multi-agent llms, 2025

2025

[43] [43]

PPTAgent: Generating and evaluating presentations beyond text-to-slides.arXiv preprint arXiv:2501.03936, 2025

Hao Zheng, Xinyan Guan, Hao Kong, Jia Zheng, Weixiang Zhou, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, and Le Sun. PPTAgent: Generating and evaluating presentations beyond text-to-slides.arXiv preprint arXiv:2501.03936, 2025

work page arXiv 2025

[44] [44]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

2023

[45] [45]

issues": [

Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. PubLayNet: Largest dataset ever for document layout analysis. In2019 International Conference on Document Analysis and Recognition, pages 1015–1022, 2019. 12 Appendix Contents ADataset Details .....................................................................................................................

2019