pith. sign in

arxiv: 2605.17356 · v1 · pith:RIRXGW63new · submitted 2026-05-17 · 💻 cs.CV

UniPPTBench: A Unified Benchmark for Presentation Generation Across Diverse Input Settings

Pith reviewed 2026-05-20 13:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords presentation generationbenchmarkevaluation protocolmultimodal inputsdocument summarizationcontent groundingcross-source synthesis
0
0 comments X

The pith

A unified benchmark tests presentation generation across vague prompts, long documents, multimodal inputs, and multiple sources with both shared and scenario-specific metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates UniPPTBench to cover four real-world input settings that existing work had studied only in isolation. It pairs the benchmark with UniPPTEval, which adds scenario-specific checks for grounded compression, visual-text alignment, and cross-source synthesis alongside general quality measures. Experiments on the benchmark show that models vary widely by setting and that high scores on visual appeal or coherence often fail to deliver strong performance on the core tasks each setting demands. The work supplies reference baselines and makes the data and code public to enable reproducible comparisons.

Core claim

UniPPTBench supplies a single testbed spanning vague-prompt, long-document, multimodal-document, and multi-source generation, while UniPPTEval combines cross-setting metrics with setting-specific ones; experiments on this platform demonstrate large performance differences across scenarios and show that generic presentation-quality scores do not reliably indicate success at content grounding, multimodal integration, or cross-source synthesis.

What carries the argument

UniPPTBench, the benchmark that unifies four representative input settings, together with UniPPTEval, the evaluation protocol that mixes shared metrics for comparison with scenario-specific metrics for each setting's core requirements.

If this is right

  • Strong generic presentation quality does not guarantee success at grounded compression from long documents.
  • Visual-text alignment must be measured separately when inputs include images or charts.
  • Cross-source synthesis becomes a distinct failure mode when material arrives from heterogeneous origins.
  • Future systems will need training objectives that target the scenario-specific capabilities rather than generic appeal alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A similar unified benchmark approach could be applied to other generative tasks such as report or slide-deck creation from mixed data.
  • Developers might use the scenario-specific metrics as auxiliary losses during model training to improve grounding without sacrificing overall layout quality.
  • The public release of the benchmark data enables direct comparison of new methods against the provided baselines on the same inputs.

Load-bearing premise

The four selected input settings and their associated scenario-specific metrics accurately capture the essential demands of real-world presentation generation.

What would settle it

If every system that scores highest on generic visual and coherence metrics also ranks highest on the scenario-specific grounding and synthesis metrics across all four settings, the claim that separate evaluations are needed would be weakened.

Figures

Figures reproduced from arXiv: 2605.17356 by Bo Zhao, Chen Zhang, Huan Yang, Maosheng Pang, Wei Ji, Yixin Cao.

Figure 1
Figure 1. Figure 1: Motivation and overview of our framework. (a) Prior work usually studies presentation generation in isolated settings, such as prompt-only or PDF [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of UniPPTEval. Candidate shared metrics are normalized, aligned with human preferences, filtered through orthogonality and efficiency [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between traditional MLLM scoring (left) and our [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall architecture of the proposed multi-agent pipeline. The framework consists of three core components: (1) the Narrative Agent, which performs deep research to transform user instructions and reference documents into structured outlines and page descriptions; (2) the Style Agent, which induces a document-level style contract from a parameterized schema; and (3) the Visual Design Agent, which translate… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of representative PPT generation systems under a Multi-Modal input setting. Despite using the same input materials and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of representative PPT generation systems under the Long-Doc input setting. It can be observed that most errors primarily arise [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 6
Figure 6. Figure 6: Consequently, future improvements must transition [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Existing works typically focus on presentation generation under isolated input settings, whereas real-world use cases span diverse scenarios, including vague user prompts, long documents, multimodal materials, and multiple heterogeneous sources. Moreover, current evaluations are often insufficiently scenario-specific. They mainly rely on generic presentation-quality criteria, such as visual appeal, layout quality, and overall coherence, but fail to assess the core capabilities required by different input settings, including grounded compression, visual-text alignment, and cross-source synthesis. Consequently, the field lacks a unified benchmark and a scenario-aware evaluation framework for faithfully diagnosing presentation-generation systems across diverse real-world settings. We present UniPPTBench, a unified benchmark for presentation generation across four representative input settings: vague-prompt, long-document, multimodal-document, and multi-source generation. We further introduce UniPPTEval, a scenario-aware evaluation protocol that combines shared metrics for cross-setting comparison with scenario-specific metrics tailored to the core requirements of each setting. We also provide transparent reference baselines to support reproducible comparison. Experiments on UniPPTBench reveal substantial performance variation across settings and recurring failure modes in content grounding, multimodal integration, and cross-source synthesis. In particular, strong performance on generic presentation-quality metrics does not necessarily imply strong task fulfillment in grounded scenarios. Together, UniPPTBench and UniPPTEval provide a faithful and diagnostic foundation for evaluating presentation generation across diverse real-world scenarios. Code and data will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces UniPPTBench, a unified benchmark for presentation generation under four input settings (vague-prompt, long-document, multimodal-document, and multi-source), together with UniPPTEval, a scenario-aware evaluation protocol that combines shared generic metrics with setting-specific ones (grounded compression, visual-text alignment, cross-source synthesis). Experiments across models demonstrate substantial performance variation across settings, recurring failure modes in grounding and integration, and that strong generic presentation-quality scores do not necessarily indicate strong task fulfillment. Transparent reference baselines are provided and code/data release is promised.

Significance. If the chosen input settings and scenario-specific metrics can be shown to align with real-world task requirements, the benchmark would offer a more diagnostic alternative to current generic evaluation practices and help surface model weaknesses in content grounding and multimodal synthesis. The explicit provision of baselines and planned public release of code and data are positive contributions to reproducibility. The work's impact would be greatest if the new metrics are validated against human judgments of task success.

major comments (2)
  1. [§3] §3 (Benchmark Construction): The four input settings are asserted to be 'representative' of real-world scenarios, yet the manuscript provides no user study, task analysis, or external validation to justify their selection or the mapping to core requirements (e.g., grounded compression for long documents). This directly affects the central claim that UniPPTBench supplies a 'faithful' foundation.
  2. [§4] §4 (UniPPTEval): The scenario-specific metrics are introduced without reported inter-rater reliability, correlation analysis against human task-success ratings, or ablation showing they better predict real-world utility than generic metrics. This is load-bearing for the experimental conclusion that generic metrics 'do not necessarily imply strong task fulfillment.'
minor comments (2)
  1. [Abstract] Abstract: The statement that 'strong performance on generic presentation-quality metrics does not necessarily imply strong task fulfillment' would be strengthened by a brief quantitative illustration or reference to a specific table/figure.
  2. Ensure formal definitions or equations are supplied for all scenario-specific metrics to support exact reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and outline the revisions we will make to strengthen the justification of the benchmark settings and the validation of the evaluation metrics.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): The four input settings are asserted to be 'representative' of real-world scenarios, yet the manuscript provides no user study, task analysis, or external validation to justify their selection or the mapping to core requirements (e.g., grounded compression for long documents). This directly affects the central claim that UniPPTBench supplies a 'faithful' foundation.

    Authors: We acknowledge that the manuscript does not present a dedicated user study or formal task analysis to empirically validate the representativeness of the four input settings. These settings were chosen to reflect frequently encountered real-world presentation generation scenarios documented in prior literature, including brief user prompts for quick slide creation, condensation of lengthy reports, integration of text with images or charts, and synthesis across heterogeneous documents. In the revised manuscript, we will expand §3 with additional references to existing surveys and use-case studies on presentation tools that support these as core scenarios. We will also provide a clearer mapping of each setting to its core requirements, such as detailing why grounded compression is necessary for long-document inputs to retain key factual content. This will better substantiate the claim of a faithful foundation without overstating current evidence. revision: yes

  2. Referee: [§4] §4 (UniPPTEval): The scenario-specific metrics are introduced without reported inter-rater reliability, correlation analysis against human task-success ratings, or ablation showing they better predict real-world utility than generic metrics. This is load-bearing for the experimental conclusion that generic metrics 'do not necessarily imply strong task fulfillment.'

    Authors: We agree that the current presentation of the scenario-specific metrics would benefit from explicit validation. The metrics target distinct capabilities required by each input setting (e.g., measuring preservation of critical information under compression for long documents). The manuscript does not yet include inter-rater reliability statistics or direct correlation analyses with human task-success judgments. In the revision, we will add a human evaluation component with multiple annotators to report inter-rater agreement and examine correlations between the proposed metrics and human assessments of task fulfillment. We will also include a comparative analysis demonstrating where scenario-specific metrics provide diagnostic value beyond generic presentation-quality scores. These changes will reinforce the experimental observation that strong generic scores do not always correspond to strong performance on setting-specific requirements. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and metrics defined independently without reduction to inputs

full rationale

The paper introduces UniPPTBench and UniPPTEval as new constructs with four input settings and scenario-specific metrics (grounded compression, visual-text alignment, cross-source synthesis). These are presented as definitional choices to address gaps in existing work, without any equations, fitted parameters renamed as predictions, or self-citation chains that bear the central claim. The assertion that generic metrics fail to imply task fulfillment is supported by experimental observations on the new benchmark rather than by construction from prior inputs. No load-bearing step reduces to self-definition or imported uniqueness; the work is self-contained as a proposal of evaluation infrastructure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation framework depends on domain assumptions about what capabilities are core to each input setting and how to measure them.

axioms (1)
  • domain assumption The four input settings are representative of diverse real-world scenarios for presentation generation.
    Stated in the abstract as covering vague user prompts, long documents, multimodal materials, and multiple heterogeneous sources.

pith-pipeline@v0.9.0 · 5794 in / 1189 out tokens · 91992 ms · 2026-05-20T13:38:36.660235+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We present UniPPTBench, a unified benchmark for presentation generation across four representative input settings: vague-prompt, long-document, multimodal-document, and multi-source generation. We further introduce UniPPTEval, a scenario-aware evaluation protocol that combines shared metrics... with scenario-specific metrics tailored to each setting’s core requirements.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 5 internal anchors

  1. [1]

    OpenAI GPT-5 System Card

    A. Singhet al., “OpenAI GPT-5 system card,”arXiv preprint arXiv:2601.03267, 2025. 16

  2. [2]

    A survey on multimodal large language models,

    S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,”National Science Review, vol. 11, no. 12, p. nwae403, 2024

  3. [3]

    ReAct: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

  4. [4]

    A survey on large language model based autonomous agents,

    L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, W. X. Zhao, Z. Wei, and J.-R. Wen, “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024

  5. [5]

    Large multimodal agents: A survey,

    J. Xie, Z. Chen, R. Zhang, X. Wan, and G. Li, “Large multimodal agents: A survey,”Visual Intelligence, vol. 3, no. 1, p. 8, 2025

  6. [6]

    AutoPresent: Designing structured visuals from scratch,

    J. Ge, Z. Z. Wang, X. Zhou, Y .-H. Peng, S. Subramanian, Q. Tan, M. Sap, A. Suhr, D. Fried, G. Neubig, and T. Darrell, “AutoPresent: Designing structured visuals from scratch,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 2902–2911

  7. [7]

    DOC2PPT: Automatic presentation slides generation from scientific documents,

    T.-J. Fu, W. Y . Wang, D. McDuff, and Y . Song, “DOC2PPT: Automatic presentation slides generation from scientific documents,” inProceed- ings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, 2022, pp. 634–642

  8. [8]

    PPTAgent: Generating and evaluating pre- sentations beyond text-to-slides,

    H. Zheng, X. Guan, H. Kong, W. Zhang, J. Zheng, W. Zhou, H. Lin, Y . Lu, X. Han, and L. Sun, “PPTAgent: Generating and evaluating pre- sentations beyond text-to-slides,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, 2025, pp. 14 402–14 418

  9. [9]

    Auto- Slides: An interactive multi-agent system for creating and customizing research presentations,

    Y . Yang, W. Jiang, Y . Wang, Y . Song, Y . Wang, and C. Zhang, “Auto- Slides: An interactive multi-agent system for creating and customizing research presentations,”arXiv preprint arXiv:2509.11062, 2025

  10. [10]

    SlideGen: Collabo- rative multimodal agents for scientific slide generation,

    X. Liang, X. Zhang, Y . Xu, S. Sun, and C. You, “SlideGen: Collabo- rative multimodal agents for scientific slide generation,”arXiv preprint arXiv:2512.04529, 2025

  11. [11]

    Kimi slides: AI presentation creator,

    Moonshot AI, “Kimi slides: AI presentation creator,” https://kimi.moo nshot.cn/slides, 2025

  12. [12]

    Generate presentations in the Gemini app,

    Google, “Generate presentations in the Gemini app,” https://blog.googl e/products-and-platforms/products/gemini/gemini-drop-october-2025/, 2025

  13. [13]

    Skywork AI slides super agent,

    Kunlun Tech, “Skywork AI slides super agent,” https://skywork.ai/age nt/en/slides, 2025

  14. [14]

    NotebookLM: AI-powered slide decks,

    Google, “NotebookLM: AI-powered slide decks,” https://blog.google/te chnology/google-labs/8-ways-to-make-the-most-out-of-slide-decks-i n-notebooklm/, 2025

  15. [15]

    Manus slides: One-click AI generation of professional slide presentations,

    Manus AI, “Manus slides: One-click AI generation of professional slide presentations,” https://manus.im/playbook/slide-generator, 2025

  16. [16]

    Pre- sentBench: A fine-grained rubric-based benchmark for slide generation,

    X.-S. Chen, J. Zhu, P.-l. Li, H. Wang, S. Yang, and M.-H. Guo, “Pre- sentBench: A fine-grained rubric-based benchmark for slide generation,” arXiv preprint arXiv:2603.07244, 2026

  17. [17]

    FaithEval: Can your language model stay faithful to context, even if “the moon is made of marshmallows

    Y . Ming, S. Purushwalkam, S. Pandit, Z. Ke, X.-P. Nguyen, C. Xiong, and S. Joty, “FaithEval: Can your language model stay faithful to context, even if “the moon is made of marshmallows”,” inInternational Conference on Learning Representations (ICLR), 2025

  18. [18]

    FActScore: Fine-grained atomic eval- uation of factual precision in long form text generation,

    S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. W. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi, “FActScore: Fine-grained atomic eval- uation of factual precision in long form text generation,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023, pp. 12 076–12 100

  19. [19]

    Enhancing presentation slide generation by llms with a multi-staged end-to-end approach,

    S. Bandyopadhyay, H. Maheshwari, A. Natarajan, and A. Saxena, “Enhancing presentation slide generation by llms with a multi-staged end-to-end approach,”arXiv preprint arXiv:2406.06556, 2024

  20. [20]

    DeepPresenter: Environment-Grounded Reflection for Agentic Presentation Generation

    H. Zheng, G. Mo, X. Yan, Q. Yuan, W. Zhang, X. Chen, Y . Lu, H. Lin, X. Han, and L. Sun, “DeepPresenter: Environment-grounded reflection for agentic presentation generation,”arXiv preprint arXiv:2602.22839, 2026

  21. [21]

    PreGenie: An agentic framework for high-quality visual presentation generation,

    X. Xu, X. Xu, S. Chen, H. Chen, F. Zhang, and Y .-C. Chen, “PreGenie: An agentic framework for high-quality visual presentation generation,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025, pp. 3045–3063

  22. [22]

    Learning to present: Inverse specification rewards for agentic slide generation,

    K. R. A. Kumar and S. Arunachalam, “Learning to present: Inverse specification rewards for agentic slide generation,”arXiv preprint arXiv:2603.16839, 2026

  23. [23]

    Gemini 3: Introducing the latest Gemini AI model from Google,

    Google, “Gemini 3: Introducing the latest Gemini AI model from Google,” https://blog.google/products/gemini/gemini-3, 2025

  24. [24]

    Survey of hallucination in natural language generation,

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,”ACM Computing Surveys, vol. 55, no. 12, 2023

  25. [25]

    Lost in the middle: How language models use long contexts,

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024

  26. [26]

    Evaluating the factual consistency of abstractive text summarization,

    W. Kryscinski, B. McCann, C. Xiong, and R. Socher, “Evaluating the factual consistency of abstractive text summarization,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 9332–9346

  27. [27]

    ChartQA: A benchmark for question answering about charts with visual and logical reasoning,

    A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, “ChartQA: A benchmark for question answering about charts with visual and logical reasoning,” inFindings of the Association for Computational Linguistics: ACL 2022, 2022, pp. 2263–2279

  28. [28]

    Do multi-document summarization models synthesize?

    J. DeYoung, S. C. Martinez, I. J. Marshall, and B. C. Wallace, “Do multi-document summarization models synthesize?”Transactions of the Association for Computational Linguistics, vol. 12, pp. 1043–1062, 2024

  29. [29]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, vol. 35, 2022, pp. 24 824–24 837

  30. [30]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, M. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,”arXiv preprint arXiv:2312.10997, 2024

  31. [31]

    LayoutDM: Discrete diffusion model for controllable layout genera- tion,

    N. Inoue, K. Kikuchi, E. Simo-Serra, M. Otani, and K. Yamaguchi, “LayoutDM: Discrete diffusion model for controllable layout genera- tion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 10 167–10 176

  32. [32]

    PosterLayout: A new benchmark and approach for content-aware visual-textual presenta- tion layout,

    H. Y . Hsu, X. He, Y . Peng, H. Kong, and Q. Zhang, “PosterLayout: A new benchmark and approach for content-aware visual-textual presenta- tion layout,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6018–6026

  33. [33]

    NIMA: Neural image assessment,

    H. Talebi and P. Milanfar, “NIMA: Neural image assessment,”IEEE Transactions on Image Processing, vol. 27, no. 8, pp. 3998–4011, 2018

  34. [34]

    Toward transparent deep image aesthetics assessment with tag-based content descriptors,

    J. Hou, W. Lin, Y . Fang, H. Wu, C. Chen, L. Liao, and W. Liu, “Toward transparent deep image aesthetics assessment with tag-based content descriptors,”IEEE Transactions on Image Processing, vol. 34, pp. 3070– 3085, 2025

  35. [35]

    Composition and style attributes guided image aesthetic assessment,

    L. Celona, M. Leonardi, P. Napoletano, and A. Rozza, “Composition and style attributes guided image aesthetic assessment,”IEEE Transactions on Image Processing, vol. 31, pp. 5009–5024, 2022

  36. [36]

    A unified probabilistic formulation of image aesthetic assessment,

    H. Zeng, Z. Cao, L. Zhang, and A. C. Bovik, “A unified probabilistic formulation of image aesthetic assessment,”IEEE Transactions on Image Processing, vol. 29, pp. 1548–1561, 2019

  37. [37]

    Personality-assisted multi- task learning for generic and personalized image aesthetics assessment,

    L. Li, H. Zhu, S. Zhao, G. Ding, and W. Lin, “Personality-assisted multi- task learning for generic and personalized image aesthetics assessment,” IEEE Transactions on Image Processing, vol. 29, pp. 3898–3910, 2020

  38. [38]

    Self-refine: Iterative refinement with self- feedback,

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziriet al., “Self-refine: Iterative refinement with self- feedback,” inAdvances in Neural Information Processing Systems, vol. 36, 2023

  39. [39]

    Presenting a paper is an art: Self-improvement aesthetic agents for academic presentations,

    C. Liu, Y . Yang, K. Zhou, Z. Zhang, Y . Fan, Y . Xie, P. Qi, and X. E. Wang, “Presenting a paper is an art: Self-improvement aesthetic agents for academic presentations,”arXiv preprint arXiv:2510.05571, 2025

  40. [40]

    Qwen3.6,

    Qwen, “Qwen3.6,” https://qwen.ai/blog?id=qwen3.6, 2026

  41. [41]

    MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

    J. Niu, Z. Liu, Z. Gu, B. Wang, L. Ouyang, Z. Zhao, T. Chu, T. He, F. Wu, Q. Zhanget al., “MinerU2.5: A decoupled vision-language model for efficient high-resolution document parsing,”arXiv preprint arXiv:2509.22186, 2025

  42. [42]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhanget al., “DeepSeek-V3 technical report,” arXiv preprint arXiv:2412.19437, 2024. APPENDIXD BIOGRAPHYSECTION Bo Zhaois currently pursuing the doctor degree in Shanghai Innovation Institute. His current research interests include multimodal large language models and content...