UniPPTBench: A Unified Benchmark for Presentation Generation Across Diverse Input Settings

Bo Zhao; Chen Zhang; Huan Yang; Maosheng Pang; Wei Ji; Yixin Cao

arxiv: 2605.17356 · v1 · pith:RIRXGW63new · submitted 2026-05-17 · 💻 cs.CV

UniPPTBench: A Unified Benchmark for Presentation Generation Across Diverse Input Settings

Bo Zhao , Maosheng Pang , Chen Zhang , Huan Yang , Yixin Cao , Wei Ji This is my paper

Pith reviewed 2026-05-20 13:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords presentation generationbenchmarkevaluation protocolmultimodal inputsdocument summarizationcontent groundingcross-source synthesis

0 comments

The pith

A unified benchmark tests presentation generation across vague prompts, long documents, multimodal inputs, and multiple sources with both shared and scenario-specific metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates UniPPTBench to cover four real-world input settings that existing work had studied only in isolation. It pairs the benchmark with UniPPTEval, which adds scenario-specific checks for grounded compression, visual-text alignment, and cross-source synthesis alongside general quality measures. Experiments on the benchmark show that models vary widely by setting and that high scores on visual appeal or coherence often fail to deliver strong performance on the core tasks each setting demands. The work supplies reference baselines and makes the data and code public to enable reproducible comparisons.

Core claim

UniPPTBench supplies a single testbed spanning vague-prompt, long-document, multimodal-document, and multi-source generation, while UniPPTEval combines cross-setting metrics with setting-specific ones; experiments on this platform demonstrate large performance differences across scenarios and show that generic presentation-quality scores do not reliably indicate success at content grounding, multimodal integration, or cross-source synthesis.

What carries the argument

UniPPTBench, the benchmark that unifies four representative input settings, together with UniPPTEval, the evaluation protocol that mixes shared metrics for comparison with scenario-specific metrics for each setting's core requirements.

If this is right

Strong generic presentation quality does not guarantee success at grounded compression from long documents.
Visual-text alignment must be measured separately when inputs include images or charts.
Cross-source synthesis becomes a distinct failure mode when material arrives from heterogeneous origins.
Future systems will need training objectives that target the scenario-specific capabilities rather than generic appeal alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A similar unified benchmark approach could be applied to other generative tasks such as report or slide-deck creation from mixed data.
Developers might use the scenario-specific metrics as auxiliary losses during model training to improve grounding without sacrificing overall layout quality.
The public release of the benchmark data enables direct comparison of new methods against the provided baselines on the same inputs.

Load-bearing premise

The four selected input settings and their associated scenario-specific metrics accurately capture the essential demands of real-world presentation generation.

What would settle it

If every system that scores highest on generic visual and coherence metrics also ranks highest on the scenario-specific grounding and synthesis metrics across all four settings, the claim that separate evaluations are needed would be weakened.

Figures

Figures reproduced from arXiv: 2605.17356 by Bo Zhao, Chen Zhang, Huan Yang, Maosheng Pang, Wei Ji, Yixin Cao.

**Figure 1.** Figure 1: Motivation and overview of our framework. (a) Prior work usually studies presentation generation in isolated settings, such as prompt-only or PDF [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of UniPPTEval. Candidate shared metrics are normalized, aligned with human preferences, filtered through orthogonality and efficiency [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison between traditional MLLM scoring (left) and our [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overall architecture of the proposed multi-agent pipeline. The framework consists of three core components: (1) the Narrative Agent, which performs deep research to transform user instructions and reference documents into structured outlines and page descriptions; (2) the Style Agent, which induces a document-level style contract from a parameterized schema; and (3) the Visual Design Agent, which translate… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of representative PPT generation systems under a Multi-Modal input setting. Despite using the same input materials and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of representative PPT generation systems under the Long-Doc input setting. It can be observed that most errors primarily arise [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 6.** Figure 6: Consequently, future improvements must transition [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

Existing works typically focus on presentation generation under isolated input settings, whereas real-world use cases span diverse scenarios, including vague user prompts, long documents, multimodal materials, and multiple heterogeneous sources. Moreover, current evaluations are often insufficiently scenario-specific. They mainly rely on generic presentation-quality criteria, such as visual appeal, layout quality, and overall coherence, but fail to assess the core capabilities required by different input settings, including grounded compression, visual-text alignment, and cross-source synthesis. Consequently, the field lacks a unified benchmark and a scenario-aware evaluation framework for faithfully diagnosing presentation-generation systems across diverse real-world settings. We present UniPPTBench, a unified benchmark for presentation generation across four representative input settings: vague-prompt, long-document, multimodal-document, and multi-source generation. We further introduce UniPPTEval, a scenario-aware evaluation protocol that combines shared metrics for cross-setting comparison with scenario-specific metrics tailored to the core requirements of each setting. We also provide transparent reference baselines to support reproducible comparison. Experiments on UniPPTBench reveal substantial performance variation across settings and recurring failure modes in content grounding, multimodal integration, and cross-source synthesis. In particular, strong performance on generic presentation-quality metrics does not necessarily imply strong task fulfillment in grounded scenarios. Together, UniPPTBench and UniPPTEval provide a faithful and diagnostic foundation for evaluating presentation generation across diverse real-world scenarios. Code and data will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniPPTBench unifies four presentation input settings with tailored metrics, but those metrics rest on unvalidated assumptions about real-world needs.

read the letter

This paper's main contribution is pulling together a benchmark that spans vague prompts, long documents, multimodal inputs, and mixed sources, then adding metrics meant to match what each setting actually requires instead of generic quality scores. That unification fills a gap left by earlier work that handled one case at a time. The experiments are straightforward and show clear performance differences across the settings, plus the useful point that models can score well on visual appeal or coherence without delivering properly grounded or synthesized content. They also release baselines and plan to share code and data, which helps reproducibility. The soft spot is the lack of grounding for the new metrics. The authors define things like grounded compression and cross-source synthesis, but the text does not report inter-rater checks, correlations with human task-success ratings, or tests showing these metrics track real presentation utility better than existing ones. Without that, the claim that the benchmark is diagnostic rests more on the authors' choices than on external evidence. The four settings themselves are reasonable but could reflect the authors' view of the problem rather than a broad survey of practitioner needs. This is useful for people building or testing multimodal generation systems that aim at practical output like slides. Readers who care about benchmark design for applied tasks will find the setup and the failure-mode analysis worth looking at. It is coherent enough on its own terms to deserve a full referee process rather than a desk reject. I would send it to review, with the expectation that the authors will need to strengthen the validation of the scenario-specific metrics.

Referee Report

2 major / 2 minor

Summary. The paper introduces UniPPTBench, a unified benchmark for presentation generation under four input settings (vague-prompt, long-document, multimodal-document, and multi-source), together with UniPPTEval, a scenario-aware evaluation protocol that combines shared generic metrics with setting-specific ones (grounded compression, visual-text alignment, cross-source synthesis). Experiments across models demonstrate substantial performance variation across settings, recurring failure modes in grounding and integration, and that strong generic presentation-quality scores do not necessarily indicate strong task fulfillment. Transparent reference baselines are provided and code/data release is promised.

Significance. If the chosen input settings and scenario-specific metrics can be shown to align with real-world task requirements, the benchmark would offer a more diagnostic alternative to current generic evaluation practices and help surface model weaknesses in content grounding and multimodal synthesis. The explicit provision of baselines and planned public release of code and data are positive contributions to reproducibility. The work's impact would be greatest if the new metrics are validated against human judgments of task success.

major comments (2)

[§3] §3 (Benchmark Construction): The four input settings are asserted to be 'representative' of real-world scenarios, yet the manuscript provides no user study, task analysis, or external validation to justify their selection or the mapping to core requirements (e.g., grounded compression for long documents). This directly affects the central claim that UniPPTBench supplies a 'faithful' foundation.
[§4] §4 (UniPPTEval): The scenario-specific metrics are introduced without reported inter-rater reliability, correlation analysis against human task-success ratings, or ablation showing they better predict real-world utility than generic metrics. This is load-bearing for the experimental conclusion that generic metrics 'do not necessarily imply strong task fulfillment.'

minor comments (2)

[Abstract] Abstract: The statement that 'strong performance on generic presentation-quality metrics does not necessarily imply strong task fulfillment' would be strengthened by a brief quantitative illustration or reference to a specific table/figure.
Ensure formal definitions or equations are supplied for all scenario-specific metrics to support exact reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and outline the revisions we will make to strengthen the justification of the benchmark settings and the validation of the evaluation metrics.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The four input settings are asserted to be 'representative' of real-world scenarios, yet the manuscript provides no user study, task analysis, or external validation to justify their selection or the mapping to core requirements (e.g., grounded compression for long documents). This directly affects the central claim that UniPPTBench supplies a 'faithful' foundation.

Authors: We acknowledge that the manuscript does not present a dedicated user study or formal task analysis to empirically validate the representativeness of the four input settings. These settings were chosen to reflect frequently encountered real-world presentation generation scenarios documented in prior literature, including brief user prompts for quick slide creation, condensation of lengthy reports, integration of text with images or charts, and synthesis across heterogeneous documents. In the revised manuscript, we will expand §3 with additional references to existing surveys and use-case studies on presentation tools that support these as core scenarios. We will also provide a clearer mapping of each setting to its core requirements, such as detailing why grounded compression is necessary for long-document inputs to retain key factual content. This will better substantiate the claim of a faithful foundation without overstating current evidence. revision: yes
Referee: [§4] §4 (UniPPTEval): The scenario-specific metrics are introduced without reported inter-rater reliability, correlation analysis against human task-success ratings, or ablation showing they better predict real-world utility than generic metrics. This is load-bearing for the experimental conclusion that generic metrics 'do not necessarily imply strong task fulfillment.'

Authors: We agree that the current presentation of the scenario-specific metrics would benefit from explicit validation. The metrics target distinct capabilities required by each input setting (e.g., measuring preservation of critical information under compression for long documents). The manuscript does not yet include inter-rater reliability statistics or direct correlation analyses with human task-success judgments. In the revision, we will add a human evaluation component with multiple annotators to report inter-rater agreement and examine correlations between the proposed metrics and human assessments of task fulfillment. We will also include a comparative analysis demonstrating where scenario-specific metrics provide diagnostic value beyond generic presentation-quality scores. These changes will reinforce the experimental observation that strong generic scores do not always correspond to strong performance on setting-specific requirements. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and metrics defined independently without reduction to inputs

full rationale

The paper introduces UniPPTBench and UniPPTEval as new constructs with four input settings and scenario-specific metrics (grounded compression, visual-text alignment, cross-source synthesis). These are presented as definitional choices to address gaps in existing work, without any equations, fitted parameters renamed as predictions, or self-citation chains that bear the central claim. The assertion that generic metrics fail to imply task fulfillment is supported by experimental observations on the new benchmark rather than by construction from prior inputs. No load-bearing step reduces to self-definition or imported uniqueness; the work is self-contained as a proposal of evaluation infrastructure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation framework depends on domain assumptions about what capabilities are core to each input setting and how to measure them.

axioms (1)

domain assumption The four input settings are representative of diverse real-world scenarios for presentation generation.
Stated in the abstract as covering vague user prompts, long documents, multimodal materials, and multiple heterogeneous sources.

pith-pipeline@v0.9.0 · 5794 in / 1189 out tokens · 91992 ms · 2026-05-20T13:38:36.660235+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present UniPPTBench, a unified benchmark for presentation generation across four representative input settings: vague-prompt, long-document, multimodal-document, and multi-source generation. We further introduce UniPPTEval, a scenario-aware evaluation protocol that combines shared metrics... with scenario-specific metrics tailored to each setting’s core requirements.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 5 internal anchors

[1]

OpenAI GPT-5 System Card

A. Singhet al., “OpenAI GPT-5 system card,”arXiv preprint arXiv:2601.03267, 2025. 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

A survey on multimodal large language models,

S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,”National Science Review, vol. 11, no. 12, p. nwae403, 2024

work page 2024
[3]

ReAct: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

work page 2023
[4]

A survey on large language model based autonomous agents,

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, W. X. Zhao, Z. Wei, and J.-R. Wen, “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024

work page 2024
[5]

Large multimodal agents: A survey,

J. Xie, Z. Chen, R. Zhang, X. Wan, and G. Li, “Large multimodal agents: A survey,”Visual Intelligence, vol. 3, no. 1, p. 8, 2025

work page 2025
[6]

AutoPresent: Designing structured visuals from scratch,

J. Ge, Z. Z. Wang, X. Zhou, Y .-H. Peng, S. Subramanian, Q. Tan, M. Sap, A. Suhr, D. Fried, G. Neubig, and T. Darrell, “AutoPresent: Designing structured visuals from scratch,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 2902–2911

work page 2025
[7]

DOC2PPT: Automatic presentation slides generation from scientific documents,

T.-J. Fu, W. Y . Wang, D. McDuff, and Y . Song, “DOC2PPT: Automatic presentation slides generation from scientific documents,” inProceed- ings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, 2022, pp. 634–642

work page 2022
[8]

PPTAgent: Generating and evaluating pre- sentations beyond text-to-slides,

H. Zheng, X. Guan, H. Kong, W. Zhang, J. Zheng, W. Zhou, H. Lin, Y . Lu, X. Han, and L. Sun, “PPTAgent: Generating and evaluating pre- sentations beyond text-to-slides,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, 2025, pp. 14 402–14 418

work page 2025
[9]

Auto- Slides: An interactive multi-agent system for creating and customizing research presentations,

Y . Yang, W. Jiang, Y . Wang, Y . Song, Y . Wang, and C. Zhang, “Auto- Slides: An interactive multi-agent system for creating and customizing research presentations,”arXiv preprint arXiv:2509.11062, 2025

work page arXiv 2025
[10]

SlideGen: Collabo- rative multimodal agents for scientific slide generation,

X. Liang, X. Zhang, Y . Xu, S. Sun, and C. You, “SlideGen: Collabo- rative multimodal agents for scientific slide generation,”arXiv preprint arXiv:2512.04529, 2025

work page arXiv 2025
[11]

Kimi slides: AI presentation creator,

Moonshot AI, “Kimi slides: AI presentation creator,” https://kimi.moo nshot.cn/slides, 2025

work page 2025
[12]

Generate presentations in the Gemini app,

Google, “Generate presentations in the Gemini app,” https://blog.googl e/products-and-platforms/products/gemini/gemini-drop-october-2025/, 2025

work page 2025
[13]

Skywork AI slides super agent,

Kunlun Tech, “Skywork AI slides super agent,” https://skywork.ai/age nt/en/slides, 2025

work page 2025
[14]

NotebookLM: AI-powered slide decks,

Google, “NotebookLM: AI-powered slide decks,” https://blog.google/te chnology/google-labs/8-ways-to-make-the-most-out-of-slide-decks-i n-notebooklm/, 2025

work page 2025
[15]

Manus slides: One-click AI generation of professional slide presentations,

Manus AI, “Manus slides: One-click AI generation of professional slide presentations,” https://manus.im/playbook/slide-generator, 2025

work page 2025
[16]

Pre- sentBench: A fine-grained rubric-based benchmark for slide generation,

X.-S. Chen, J. Zhu, P.-l. Li, H. Wang, S. Yang, and M.-H. Guo, “Pre- sentBench: A fine-grained rubric-based benchmark for slide generation,” arXiv preprint arXiv:2603.07244, 2026

work page arXiv 2026
[17]

FaithEval: Can your language model stay faithful to context, even if “the moon is made of marshmallows

Y . Ming, S. Purushwalkam, S. Pandit, Z. Ke, X.-P. Nguyen, C. Xiong, and S. Joty, “FaithEval: Can your language model stay faithful to context, even if “the moon is made of marshmallows”,” inInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[18]

FActScore: Fine-grained atomic eval- uation of factual precision in long form text generation,

S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. W. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi, “FActScore: Fine-grained atomic eval- uation of factual precision in long form text generation,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023, pp. 12 076–12 100

work page 2023
[19]

Enhancing presentation slide generation by llms with a multi-staged end-to-end approach,

S. Bandyopadhyay, H. Maheshwari, A. Natarajan, and A. Saxena, “Enhancing presentation slide generation by llms with a multi-staged end-to-end approach,”arXiv preprint arXiv:2406.06556, 2024

work page arXiv 2024
[20]

DeepPresenter: Environment-Grounded Reflection for Agentic Presentation Generation

H. Zheng, G. Mo, X. Yan, Q. Yuan, W. Zhang, X. Chen, Y . Lu, H. Lin, X. Han, and L. Sun, “DeepPresenter: Environment-grounded reflection for agentic presentation generation,”arXiv preprint arXiv:2602.22839, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

PreGenie: An agentic framework for high-quality visual presentation generation,

X. Xu, X. Xu, S. Chen, H. Chen, F. Zhang, and Y .-C. Chen, “PreGenie: An agentic framework for high-quality visual presentation generation,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025, pp. 3045–3063

work page 2025
[22]

Learning to present: Inverse specification rewards for agentic slide generation,

K. R. A. Kumar and S. Arunachalam, “Learning to present: Inverse specification rewards for agentic slide generation,”arXiv preprint arXiv:2603.16839, 2026

work page arXiv 2026
[23]

Gemini 3: Introducing the latest Gemini AI model from Google,

Google, “Gemini 3: Introducing the latest Gemini AI model from Google,” https://blog.google/products/gemini/gemini-3, 2025

work page 2025
[24]

Survey of hallucination in natural language generation,

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,”ACM Computing Surveys, vol. 55, no. 12, 2023

work page 2023
[25]

Lost in the middle: How language models use long contexts,

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024

work page 2024
[26]

Evaluating the factual consistency of abstractive text summarization,

W. Kryscinski, B. McCann, C. Xiong, and R. Socher, “Evaluating the factual consistency of abstractive text summarization,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 9332–9346

work page 2020
[27]

ChartQA: A benchmark for question answering about charts with visual and logical reasoning,

A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, “ChartQA: A benchmark for question answering about charts with visual and logical reasoning,” inFindings of the Association for Computational Linguistics: ACL 2022, 2022, pp. 2263–2279

work page 2022
[28]

Do multi-document summarization models synthesize?

J. DeYoung, S. C. Martinez, I. J. Marshall, and B. C. Wallace, “Do multi-document summarization models synthesize?”Transactions of the Association for Computational Linguistics, vol. 12, pp. 1043–1062, 2024

work page 2024
[29]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, vol. 35, 2022, pp. 24 824–24 837

work page 2022
[30]

Retrieval-Augmented Generation for Large Language Models: A Survey

Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, M. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,”arXiv preprint arXiv:2312.10997, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

LayoutDM: Discrete diffusion model for controllable layout genera- tion,

N. Inoue, K. Kikuchi, E. Simo-Serra, M. Otani, and K. Yamaguchi, “LayoutDM: Discrete diffusion model for controllable layout genera- tion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 10 167–10 176

work page 2023
[32]

PosterLayout: A new benchmark and approach for content-aware visual-textual presenta- tion layout,

H. Y . Hsu, X. He, Y . Peng, H. Kong, and Q. Zhang, “PosterLayout: A new benchmark and approach for content-aware visual-textual presenta- tion layout,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6018–6026

work page 2023
[33]

NIMA: Neural image assessment,

H. Talebi and P. Milanfar, “NIMA: Neural image assessment,”IEEE Transactions on Image Processing, vol. 27, no. 8, pp. 3998–4011, 2018

work page 2018
[34]

Toward transparent deep image aesthetics assessment with tag-based content descriptors,

J. Hou, W. Lin, Y . Fang, H. Wu, C. Chen, L. Liao, and W. Liu, “Toward transparent deep image aesthetics assessment with tag-based content descriptors,”IEEE Transactions on Image Processing, vol. 34, pp. 3070– 3085, 2025

work page 2025
[35]

Composition and style attributes guided image aesthetic assessment,

L. Celona, M. Leonardi, P. Napoletano, and A. Rozza, “Composition and style attributes guided image aesthetic assessment,”IEEE Transactions on Image Processing, vol. 31, pp. 5009–5024, 2022

work page 2022
[36]

A unified probabilistic formulation of image aesthetic assessment,

H. Zeng, Z. Cao, L. Zhang, and A. C. Bovik, “A unified probabilistic formulation of image aesthetic assessment,”IEEE Transactions on Image Processing, vol. 29, pp. 1548–1561, 2019

work page 2019
[37]

Personality-assisted multi- task learning for generic and personalized image aesthetics assessment,

L. Li, H. Zhu, S. Zhao, G. Ding, and W. Lin, “Personality-assisted multi- task learning for generic and personalized image aesthetics assessment,” IEEE Transactions on Image Processing, vol. 29, pp. 3898–3910, 2020

work page 2020
[38]

Self-refine: Iterative refinement with self- feedback,

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziriet al., “Self-refine: Iterative refinement with self- feedback,” inAdvances in Neural Information Processing Systems, vol. 36, 2023

work page 2023
[39]

Presenting a paper is an art: Self-improvement aesthetic agents for academic presentations,

C. Liu, Y . Yang, K. Zhou, Z. Zhang, Y . Fan, Y . Xie, P. Qi, and X. E. Wang, “Presenting a paper is an art: Self-improvement aesthetic agents for academic presentations,”arXiv preprint arXiv:2510.05571, 2025

work page arXiv 2025
[40]

Qwen3.6,

Qwen, “Qwen3.6,” https://qwen.ai/blog?id=qwen3.6, 2026

work page 2026
[41]

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

J. Niu, Z. Liu, Z. Gu, B. Wang, L. Ouyang, Z. Zhao, T. Chu, T. He, F. Wu, Q. Zhanget al., “MinerU2.5: A decoupled vision-language model for efficient high-resolution document parsing,”arXiv preprint arXiv:2509.22186, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

DeepSeek-V3 Technical Report

DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhanget al., “DeepSeek-V3 technical report,” arXiv preprint arXiv:2412.19437, 2024. APPENDIXD BIOGRAPHYSECTION Bo Zhaois currently pursuing the doctor degree in Shanghai Innovation Institute. His current research interests include multimodal large language models and content...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

OpenAI GPT-5 System Card

A. Singhet al., “OpenAI GPT-5 system card,”arXiv preprint arXiv:2601.03267, 2025. 16

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

A survey on multimodal large language models,

S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,”National Science Review, vol. 11, no. 12, p. nwae403, 2024

work page 2024

[3] [3]

ReAct: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

work page 2023

[4] [4]

A survey on large language model based autonomous agents,

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, W. X. Zhao, Z. Wei, and J.-R. Wen, “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024

work page 2024

[5] [5]

Large multimodal agents: A survey,

J. Xie, Z. Chen, R. Zhang, X. Wan, and G. Li, “Large multimodal agents: A survey,”Visual Intelligence, vol. 3, no. 1, p. 8, 2025

work page 2025

[6] [6]

AutoPresent: Designing structured visuals from scratch,

J. Ge, Z. Z. Wang, X. Zhou, Y .-H. Peng, S. Subramanian, Q. Tan, M. Sap, A. Suhr, D. Fried, G. Neubig, and T. Darrell, “AutoPresent: Designing structured visuals from scratch,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 2902–2911

work page 2025

[7] [7]

DOC2PPT: Automatic presentation slides generation from scientific documents,

T.-J. Fu, W. Y . Wang, D. McDuff, and Y . Song, “DOC2PPT: Automatic presentation slides generation from scientific documents,” inProceed- ings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, 2022, pp. 634–642

work page 2022

[8] [8]

PPTAgent: Generating and evaluating pre- sentations beyond text-to-slides,

H. Zheng, X. Guan, H. Kong, W. Zhang, J. Zheng, W. Zhou, H. Lin, Y . Lu, X. Han, and L. Sun, “PPTAgent: Generating and evaluating pre- sentations beyond text-to-slides,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, 2025, pp. 14 402–14 418

work page 2025

[9] [9]

Auto- Slides: An interactive multi-agent system for creating and customizing research presentations,

Y . Yang, W. Jiang, Y . Wang, Y . Song, Y . Wang, and C. Zhang, “Auto- Slides: An interactive multi-agent system for creating and customizing research presentations,”arXiv preprint arXiv:2509.11062, 2025

work page arXiv 2025

[10] [10]

SlideGen: Collabo- rative multimodal agents for scientific slide generation,

X. Liang, X. Zhang, Y . Xu, S. Sun, and C. You, “SlideGen: Collabo- rative multimodal agents for scientific slide generation,”arXiv preprint arXiv:2512.04529, 2025

work page arXiv 2025

[11] [11]

Kimi slides: AI presentation creator,

Moonshot AI, “Kimi slides: AI presentation creator,” https://kimi.moo nshot.cn/slides, 2025

work page 2025

[12] [12]

Generate presentations in the Gemini app,

Google, “Generate presentations in the Gemini app,” https://blog.googl e/products-and-platforms/products/gemini/gemini-drop-october-2025/, 2025

work page 2025

[13] [13]

Skywork AI slides super agent,

Kunlun Tech, “Skywork AI slides super agent,” https://skywork.ai/age nt/en/slides, 2025

work page 2025

[14] [14]

NotebookLM: AI-powered slide decks,

Google, “NotebookLM: AI-powered slide decks,” https://blog.google/te chnology/google-labs/8-ways-to-make-the-most-out-of-slide-decks-i n-notebooklm/, 2025

work page 2025

[15] [15]

Manus slides: One-click AI generation of professional slide presentations,

Manus AI, “Manus slides: One-click AI generation of professional slide presentations,” https://manus.im/playbook/slide-generator, 2025

work page 2025

[16] [16]

Pre- sentBench: A fine-grained rubric-based benchmark for slide generation,

X.-S. Chen, J. Zhu, P.-l. Li, H. Wang, S. Yang, and M.-H. Guo, “Pre- sentBench: A fine-grained rubric-based benchmark for slide generation,” arXiv preprint arXiv:2603.07244, 2026

work page arXiv 2026

[17] [17]

FaithEval: Can your language model stay faithful to context, even if “the moon is made of marshmallows

Y . Ming, S. Purushwalkam, S. Pandit, Z. Ke, X.-P. Nguyen, C. Xiong, and S. Joty, “FaithEval: Can your language model stay faithful to context, even if “the moon is made of marshmallows”,” inInternational Conference on Learning Representations (ICLR), 2025

work page 2025

[18] [18]

FActScore: Fine-grained atomic eval- uation of factual precision in long form text generation,

S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. W. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi, “FActScore: Fine-grained atomic eval- uation of factual precision in long form text generation,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023, pp. 12 076–12 100

work page 2023

[19] [19]

Enhancing presentation slide generation by llms with a multi-staged end-to-end approach,

S. Bandyopadhyay, H. Maheshwari, A. Natarajan, and A. Saxena, “Enhancing presentation slide generation by llms with a multi-staged end-to-end approach,”arXiv preprint arXiv:2406.06556, 2024

work page arXiv 2024

[20] [20]

DeepPresenter: Environment-Grounded Reflection for Agentic Presentation Generation

H. Zheng, G. Mo, X. Yan, Q. Yuan, W. Zhang, X. Chen, Y . Lu, H. Lin, X. Han, and L. Sun, “DeepPresenter: Environment-grounded reflection for agentic presentation generation,”arXiv preprint arXiv:2602.22839, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

PreGenie: An agentic framework for high-quality visual presentation generation,

X. Xu, X. Xu, S. Chen, H. Chen, F. Zhang, and Y .-C. Chen, “PreGenie: An agentic framework for high-quality visual presentation generation,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025, pp. 3045–3063

work page 2025

[22] [22]

Learning to present: Inverse specification rewards for agentic slide generation,

K. R. A. Kumar and S. Arunachalam, “Learning to present: Inverse specification rewards for agentic slide generation,”arXiv preprint arXiv:2603.16839, 2026

work page arXiv 2026

[23] [23]

Gemini 3: Introducing the latest Gemini AI model from Google,

Google, “Gemini 3: Introducing the latest Gemini AI model from Google,” https://blog.google/products/gemini/gemini-3, 2025

work page 2025

[24] [24]

Survey of hallucination in natural language generation,

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,”ACM Computing Surveys, vol. 55, no. 12, 2023

work page 2023

[25] [25]

Lost in the middle: How language models use long contexts,

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024

work page 2024

[26] [26]

Evaluating the factual consistency of abstractive text summarization,

W. Kryscinski, B. McCann, C. Xiong, and R. Socher, “Evaluating the factual consistency of abstractive text summarization,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 9332–9346

work page 2020

[27] [27]

ChartQA: A benchmark for question answering about charts with visual and logical reasoning,

A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, “ChartQA: A benchmark for question answering about charts with visual and logical reasoning,” inFindings of the Association for Computational Linguistics: ACL 2022, 2022, pp. 2263–2279

work page 2022

[28] [28]

Do multi-document summarization models synthesize?

J. DeYoung, S. C. Martinez, I. J. Marshall, and B. C. Wallace, “Do multi-document summarization models synthesize?”Transactions of the Association for Computational Linguistics, vol. 12, pp. 1043–1062, 2024

work page 2024

[29] [29]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, vol. 35, 2022, pp. 24 824–24 837

work page 2022

[30] [30]

Retrieval-Augmented Generation for Large Language Models: A Survey

Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, M. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,”arXiv preprint arXiv:2312.10997, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

LayoutDM: Discrete diffusion model for controllable layout genera- tion,

N. Inoue, K. Kikuchi, E. Simo-Serra, M. Otani, and K. Yamaguchi, “LayoutDM: Discrete diffusion model for controllable layout genera- tion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 10 167–10 176

work page 2023

[32] [32]

PosterLayout: A new benchmark and approach for content-aware visual-textual presenta- tion layout,

H. Y . Hsu, X. He, Y . Peng, H. Kong, and Q. Zhang, “PosterLayout: A new benchmark and approach for content-aware visual-textual presenta- tion layout,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6018–6026

work page 2023

[33] [33]

NIMA: Neural image assessment,

H. Talebi and P. Milanfar, “NIMA: Neural image assessment,”IEEE Transactions on Image Processing, vol. 27, no. 8, pp. 3998–4011, 2018

work page 2018

[34] [34]

Toward transparent deep image aesthetics assessment with tag-based content descriptors,

J. Hou, W. Lin, Y . Fang, H. Wu, C. Chen, L. Liao, and W. Liu, “Toward transparent deep image aesthetics assessment with tag-based content descriptors,”IEEE Transactions on Image Processing, vol. 34, pp. 3070– 3085, 2025

work page 2025

[35] [35]

Composition and style attributes guided image aesthetic assessment,

L. Celona, M. Leonardi, P. Napoletano, and A. Rozza, “Composition and style attributes guided image aesthetic assessment,”IEEE Transactions on Image Processing, vol. 31, pp. 5009–5024, 2022

work page 2022

[36] [36]

A unified probabilistic formulation of image aesthetic assessment,

H. Zeng, Z. Cao, L. Zhang, and A. C. Bovik, “A unified probabilistic formulation of image aesthetic assessment,”IEEE Transactions on Image Processing, vol. 29, pp. 1548–1561, 2019

work page 2019

[37] [37]

Personality-assisted multi- task learning for generic and personalized image aesthetics assessment,

L. Li, H. Zhu, S. Zhao, G. Ding, and W. Lin, “Personality-assisted multi- task learning for generic and personalized image aesthetics assessment,” IEEE Transactions on Image Processing, vol. 29, pp. 3898–3910, 2020

work page 2020

[38] [38]

Self-refine: Iterative refinement with self- feedback,

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziriet al., “Self-refine: Iterative refinement with self- feedback,” inAdvances in Neural Information Processing Systems, vol. 36, 2023

work page 2023

[39] [39]

Presenting a paper is an art: Self-improvement aesthetic agents for academic presentations,

C. Liu, Y . Yang, K. Zhou, Z. Zhang, Y . Fan, Y . Xie, P. Qi, and X. E. Wang, “Presenting a paper is an art: Self-improvement aesthetic agents for academic presentations,”arXiv preprint arXiv:2510.05571, 2025

work page arXiv 2025

[40] [40]

Qwen3.6,

Qwen, “Qwen3.6,” https://qwen.ai/blog?id=qwen3.6, 2026

work page 2026

[41] [41]

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

J. Niu, Z. Liu, Z. Gu, B. Wang, L. Ouyang, Z. Zhao, T. Chu, T. He, F. Wu, Q. Zhanget al., “MinerU2.5: A decoupled vision-language model for efficient high-resolution document parsing,”arXiv preprint arXiv:2509.22186, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

DeepSeek-V3 Technical Report

DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhanget al., “DeepSeek-V3 technical report,” arXiv preprint arXiv:2412.19437, 2024. APPENDIXD BIOGRAPHYSECTION Bo Zhaois currently pursuing the doctor degree in Shanghai Innovation Institute. His current research interests include multimodal large language models and content...

work page internal anchor Pith review Pith/arXiv arXiv 2024