pith. machine review for the scientific record. sign in

arxiv: 2605.11363 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.CL

Recognition: no theorem link

PresentAgent-2: Towards Generalist Multimodal Presentation Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:25 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords presentation video generationmultimodal agentsquery-driven presentationsresearch groundingslide constructiondialogue generationinteractive presentationsmultimodal resources
0
0 comments X

The pith

PresentAgent-2 generates full presentation videos from open user queries by researching multimodal sources and composing slides, scripts, and media.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PresentAgent-2 is an agentic framework that takes an open-ended user query and a selected presentation mode to produce a complete video. It summarizes the query, researches presentation-friendly sources for text, images, GIFs, and videos, then builds slides and generates mode-specific scripts. The system assembles these elements with audio and dynamic media into finished videos. It handles three modes in one framework: single-speaker narration, multi-speaker discussion with defined roles, and independent interaction for answering questions grounded in the content. A new benchmark tests the outputs on content quality, media relevance, dynamic media use, dialogue naturalness, and interaction grounding.

Core claim

PresentAgent-2 extends presentation generation from document-dependent slide creation to query-driven, research-grounded presentation video generation with multimodal media, dialogue, and interaction.

What carries the argument

PresentAgent-2, the agentic framework that summarizes queries, researches multimodal resources, constructs slides and mode-specific scripts, and composes complete videos.

If this is right

  • Generates single-speaker narrated presentation videos directly from queries.
  • Creates multi-speaker discussions with structured roles for questions, explanations, and summaries.
  • Supports separate interaction mode for answering audience questions grounded in slides, scripts, and evidence.
  • Provides task-specific evaluation criteria in a multimodal benchmark covering content, media, dialogue, and interaction quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could support on-demand generation of educational or training videos across many topics.
  • Extending research sources to include real-time data might enable presentations that stay current.
  • The same query-to-video pipeline may apply to other formats such as marketing pitches or internal briefings.

Load-bearing premise

Automated research over presentation-friendly sources will reliably yield coherent, high-quality multimodal resources that compose into natural videos without major factual errors or integration failures.

What would settle it

Benchmark evaluations that show frequent factual inaccuracies, low media relevance scores, or unnatural dialogue and integration in the generated videos would undermine the central claim.

Figures

Figures reproduced from arXiv: 2605.11363 by Hao Tang, Wei Wu, Yang Zhao, Zeyu Zhang, Ziyang Xu.

Figure 1
Figure 1. Figure 1: Representative frames from a generated PresentAgent-2 presentation video. The frames are sampled from different timestamps of the same video, showing how retrieved video evidence is incorporated into the generated presentation. However, these methods mostly assume that the source content is already given as a complete document, such as a paper, report, or technical blog [5, 6, 7]. They focus on converting … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PresentAgent-2. PresentAgent-2 turns a user query into a presentation video through deep research, slide/script generation, audio synthesis, and video composition. • We support three independent presentation video modes within a unified framework: Single Presentation, Discussion, and Interaction. These modes correspond to single-speaker narra￾tion, multi-speaker dialogue, and grounded interacti… view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation pipeline. Objective quiz eval [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the PresentAgent-2 framework. Given a user query and a selected presentation [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative examples of PresentAgent-2 across three presentation settings. Rows from top to [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional qualitative examples of PresentAgent-2, Part 1. Each column corresponds to [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional qualitative examples of PresentAgent-2, Part 2. Each column corresponds to [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework for generating presentation videos from user queries. Given an open-ended user query and a selected presentation mode, PresentAgent-2 first summarizes the query into a focused topic and performs deep research over presentation-friendly sources to collect multimodal resources, including relevant text, images, GIFs, and videos. It then constructs presentation slides, generates mode-specific scripts, and composes slides, audio, and dynamic media into a complete presentation video. PresentAgent-2 supports three independent presentation modes within a unified framework: Single Presentation, which generates a single-speaker narrated presentation video; Discussion, which creates a multi-speaker presentation with structured speaker roles, such as for asking guiding questions, explaining concepts, clarifying details, and summarizing key points; and Interaction, which independently supports answering audience questions grounded in the generated slides, scripts, retrieved evidence, and presentation context. To evaluate these capabilities, we build a multimodal presentation benchmark covering single presentation, discussion, and interaction scenarios, with task-specific evaluation criteria for content quality, media relevance, dynamic media use, dialogue naturalness, and interaction grounding. Overall, PresentAgent-2 extends presentation generation from document-dependent slide creation to query-driven, research-grounded presentation video generation with multimodal media, dialogue, and interaction. Code: https://github.com/AIGeeksGroup/PresentAgent-2. Website: https://aigeeksgroup.github.io/PresentAgent-2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces PresentAgent-2, an agentic framework for generating presentation videos from user queries. It summarizes the query, performs deep research over presentation-friendly sources to collect multimodal resources (text, images, GIFs, videos), constructs slides, generates mode-specific scripts, and composes complete videos. The framework supports three modes in a unified system: Single Presentation (narrated video), Discussion (multi-speaker with structured roles), and Interaction (grounded Q&A). The paper also describes a multimodal presentation benchmark with criteria for content quality, media relevance, dynamic media use, dialogue naturalness, and interaction grounding.

Significance. If the framework's research and composition steps reliably produce coherent, error-free multimodal videos, this would represent a meaningful advance toward generalist agents for query-driven presentation video generation, extending beyond static document-based slides to include dialogue and interaction. The unified handling of three distinct modes and the introduction of a dedicated benchmark with task-specific metrics are notable strengths. The public code release further supports reproducibility and extension.

major comments (3)
  1. [Framework description (research collection and composition steps)] The central claim that PresentAgent-2 enables 'query-driven, research-grounded presentation video generation' rests on the automated research step yielding high-quality, factually accurate multimodal resources that compose without errors. The framework description (research collection → slide construction → script generation → video composition) provides no explicit verification, grounding, or error-correction mechanisms for collected resources, making factual hallucinations or integration failures a direct risk to all downstream outputs and the three supported modes.
  2. [Evaluation and benchmark description] Although the paper states that 'to evaluate these capabilities, we build a multimodal presentation benchmark' with specific criteria, no quantitative results, ablation studies, error analysis, or baseline comparisons are reported. This absence makes it impossible to assess whether the described capabilities are achieved or to validate claims about content quality, media relevance, or interaction grounding.
  3. [Mode descriptions and system architecture] The manuscript claims support for 'three independent presentation modes within a unified framework' but offers no details on how mode-specific scripts are generated, how speaker roles are enforced in Discussion mode, or how Interaction mode grounds answers in slides/scripts/evidence. Without these implementation specifics or results, the independence and integration claims cannot be evaluated.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from one or two concrete examples illustrating the input query, collected resources, and output video for each of the three modes to clarify the distinctions.
  2. [Framework overview] Notation for the overall pipeline (e.g., how the topic summary feeds into research) is not formalized; a simple diagram or pseudocode would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's thorough review and positive remarks on the potential impact of PresentAgent-2 and the benchmark. We address the major comments point by point below, with plans for revisions to enhance clarity and completeness.

read point-by-point responses
  1. Referee: [Framework description (research collection and composition steps)] The central claim that PresentAgent-2 enables 'query-driven, research-grounded presentation video generation' rests on the automated research step yielding high-quality, factually accurate multimodal resources that compose without errors. The framework description (research collection → slide construction → script generation → video composition) provides no explicit verification, grounding, or error-correction mechanisms for collected resources, making factual hallucinations or integration failures a direct risk to all downstream outputs and the three supported modes.

    Authors: We agree that additional details on verification mechanisms would strengthen the paper. The framework employs an agentic approach with built-in grounding through multi-source retrieval and LLM-based consistency checks during resource collection. In the revised manuscript, we will include a new subsection under the framework description that explicitly outlines the verification, grounding, and error-correction steps, such as cross-validation of facts across sources and iterative refinement of collected media. revision: yes

  2. Referee: [Evaluation and benchmark description] Although the paper states that 'to evaluate these capabilities, we build a multimodal presentation benchmark' with specific criteria, no quantitative results, ablation studies, error analysis, or baseline comparisons are reported. This absence makes it impossible to assess whether the described capabilities are achieved or to validate claims about content quality, media relevance, or interaction grounding.

    Authors: The manuscript introduces the benchmark and includes qualitative evaluations via detailed case studies demonstrating the capabilities across modes. We acknowledge that quantitative results, ablations, and baseline comparisons are not yet included. We will add these in the revision by reporting human evaluation scores on the criteria, error analysis from the case studies, and comparisons where applicable. This will provide a more rigorous assessment. revision: yes

  3. Referee: [Mode descriptions and system architecture] The manuscript claims support for 'three independent presentation modes within a unified framework' but offers no details on how mode-specific scripts are generated, how speaker roles are enforced in Discussion mode, or how Interaction mode grounds answers in slides/scripts/evidence. Without these implementation specifics or results, the independence and integration claims cannot be evaluated.

    Authors: We will revise the architecture and mode description sections to provide the requested implementation details. Specifically, we will describe the prompt templates and role assignment logic for Discussion mode, the script generation process differentiated by mode, and the retrieval mechanism for grounding answers in Interaction mode. Additional figures illustrating the unified pipeline with mode branches will be added. revision: yes

Circularity Check

0 steps flagged

No circularity: systems description with no derivations or fitted predictions

full rationale

The paper describes an agentic pipeline for query-driven presentation video generation (research collection → slide construction → script generation → multimodal composition) across three modes, supported by a new benchmark with task-specific criteria. No equations, parameter fits, predictions, or first-principles derivations appear in the provided text or abstract; claims rest on the implemented framework and external evaluation rather than any self-referential reduction. Self-citations are absent from the load-bearing steps, and the weakest assumption (research quality) is an engineering limitation, not a circularity in derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The system rests on standard assumptions about LLM capabilities for summarization and research rather than new parameters or entities.

axioms (1)
  • domain assumption Large language models and multimodal models can reliably summarize queries, retrieve relevant media, and generate coherent scripts and dialogue.
    Invoked implicitly in the description of the research, slide construction, and script generation steps.

pith-pipeline@v0.9.0 · 5593 in / 1146 out tokens · 46348 ms · 2026-05-13T02:25:32.478083+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 7 internal anchors

  1. [1]

    Paper2poster: Towards multimodal poster automation from scientific papers.arXiv preprint arXiv:2505.21497, 2025

    Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, and Philip Torr. Paper2poster: Towards multimodal poster automation from scientific papers.arXiv preprint arXiv:2505.21497, 2025

  2. [2]

    Presentagent: Multimodal agent for presentation video generation

    Jingwei Shi, Zeyu Zhang, Biao Wu, Yanjie Liang, Meng Fang, Ling Chen, and Yang Zhao. Presentagent: Multimodal agent for presentation video generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 760–773, 2025

  3. [3]

    Paper2video: Automatic video generation from scientific papers.arXiv preprint arXiv:2510.05096, 2025

    Zeyu Zhu, Kevin Qinghong Lin, and Mike Zheng Shou. Paper2video: Automatic video generation from scientific papers.arXiv preprint arXiv:2510.05096, 2025

  4. [4]

    VideoAgent: Personalized Synthesis of Scientific Videos

    Xiao Liang, Bangxin Li, Zixuan Chen, Hanyue Zheng, Zhi Ma, Di Wang, Cong Tian, and Quan Wang. Videoagent: Personalized synthesis of scientific videos.arXiv preprint arXiv:2509.11253, 2025

  5. [5]

    Talk to Your Slides: High-Efficiency Slide Editing via Language-Driven Structured Data Manipulation

    Kyudan Jung, Hojun Cho, Jooyeol Yun, Soyoung Yang, Jaehyeok Jang, and Jaegul Choo. Talk to your slides: Language-driven agents for efficient slide editing.arXiv preprint arXiv:2505.11604, 2025

  6. [6]

    Pptagent: Generating and evaluating presentations beyond text-to-slides

    Hao Zheng, Xinyan Guan, Hao Kong, Wenkai Zhang, Jia Zheng, Weixiang Zhou, Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. Pptagent: Generating and evaluating presentations beyond text-to-slides. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14413–14429, 2025

  7. [7]

    Auto-slides: An interactive multi-agent system for creating and customizing research presentations.arXiv preprint arXiv:2509.11062, 2025

    Yuheng Yang, Wenjia Jiang, Yang Wang, Yi Song, Yiwei Wang, and Chi Zhang. Auto-slides: An interactive multi-agent system for creating and customizing research presentations.arXiv preprint arXiv:2509.11062, 2025

  8. [8]

    Node-based editing for multimodal generation of text, audio, image, and video.arXiv preprint arXiv:2511.03227, 2025

    Alexander Htet Kyaw and Lenin Ravindranath Sivalingam. Node-based editing for multimodal generation of text, audio, image, and video.arXiv preprint arXiv:2511.03227, 2025

  9. [9]

    Polyvivid: Vivid multi-subject video generation with cross-modal interaction and enhancement

    Teng Hu, Zhentao Yu, Zhengguang Zhou, Jiangning Zhang, Yuan Zhou, Qinglin Lu, and Ran Yi. Polyvivid: Vivid multi-subject video generation with cross-modal interaction and enhancement. arXiv preprint arXiv:2506.07848, 2025

  10. [10]

    Let them talk: Audio-driven multi-person conversational video generation.arXiv preprint arXiv:2505.22647, 2025

    Zhe Kong, Feng Gao, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Xunliang Cai, Guanying Chen, and Wenhan Luo. Let them talk: Audio-driven multi-person conversational video generation.arXiv preprint arXiv:2505.22647, 2025

  11. [11]

    Autopresent: Designing structured visuals from scratch

    Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, et al. Autopresent: Designing structured visuals from scratch. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2902–2911, 2025

  12. [12]

    Infinity parser: Layout aware reinforcement learning for scanned document parsing.arXiv preprint arXiv:2506.03197, 2025

    Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Zuming Huang, Jun Huang, Haozhe Wang, Yanjie Liang, Ling Chen, Wei Chu, et al. Infinity parser: Layout aware reinforcement learning for scanned document parsing.arXiv preprint arXiv:2506.03197, 2025

  13. [13]

    Doc2ppt: Automatic presen- tation slides generation from scientific documents

    Tsu-Jui Fu, William Yang Wang, Daniel McDuff, and Yale Song. Doc2ppt: Automatic presen- tation slides generation from scientific documents. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 634–642, 2022. 10

  14. [14]

    Slides agent: An intelligent agent for creating and analyzing presentations using large

    Aleksandr Konstantinov, Anna Avdyushina, and Tatiana Markina. Slides agent: An intelligent agent for creating and analyzing presentations using large. InCreativity in Intelligent Tech- nologies and Data Science: 6th International Conference, CIT&DS 2025, Volgograd, Russia, September 22–25, 2025, Proceedings, page 123. Springer Nature, 2026

  15. [15]

    Slidegen: Collaborative multimodal agents for scientific slide generation.arXiv preprint arXiv:2512.04529, 2025

    Xin Liang, Xiang Zhang, Yiwei Xu, Siqi Sun, and Chenyu You. Slidegen: Collaborative multimodal agents for scientific slide generation.arXiv preprint arXiv:2512.04529, 2025

  16. [16]

    Presenting a paper is an art: Self-improvement aesthetic agents for academic presentations.arXiv preprint arXiv:2510.05571, 2025

    Chengzhi Liu, Yuzhe Yang, Kaiwen Zhou, Zhen Zhang, Yue Fan, Yanan Xie, Peng Qi, and Xin Eric Wang. Presenting a paper is an art: Self-improvement aesthetic agents for academic presentations.arXiv preprint arXiv:2510.05571, 2025

  17. [17]

    Gpt4tools: Teaching large language model to use tools via self-instruction.Advances in Neural Information Processing Systems, 36:71995–72007, 2023

    Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, and Ying Shan. Gpt4tools: Teaching large language model to use tools via self-instruction.Advances in Neural Information Processing Systems, 36:71995–72007, 2023

  18. [18]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

  19. [19]

    Os-genesis: Automating gui agent trajectory construction via reverse task synthesis

    Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, et al. Os-genesis: Automating gui agent trajectory construction via reverse task synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5555–5579, 2025

  20. [20]

    arXiv preprint arXiv:2309.00398 (2023) 6, 1

    Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, and Jingdong Wang. Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation.arXiv preprint arXiv:2309.00398, 2023

  21. [21]

    Phyt2v: Llm-guided iterative self- refinement for physics-grounded text-to-video generation

    Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao. Phyt2v: Llm-guided iterative self- refinement for physics-grounded text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18826–18836, 2025

  22. [22]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  23. [23]

    thinking with images

    Shanshan Zhao, Xinjie Zhang, Jintao Guo, Jiakui Hu, Lunhao Duan, Minghao Fu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567, 2025

  24. [24]

    Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

  25. [25]

    Motion anything: Any to motion generation.arXiv preprint arXiv:2503.06955, 2025

    Zeyu Zhang, Yiran Wang, Wei Mao, Danning Li, Rui Zhao, Biao Wu, Zirui Song, Bohan Zhuang, Ian Reid, and Richard Hartley. Motion anything: Any to motion generation.arXiv preprint arXiv:2503.06955, 2025

  26. [26]

    Infinimotion: Mamba boosts memory in transformer for arbitrary long motion generation.arXiv preprint arXiv:2407.10061, 2024

    Zeyu Zhang, Akide Liu, Qi Chen, Feng Chen, Ian Reid, Richard Hartley, Bohan Zhuang, and Hao Tang. Infinimotion: Mamba boosts memory in transformer for arbitrary long motion generation.arXiv preprint arXiv:2407.10061, 2024

  27. [27]

    Kmm: Key frame mask mamba for extended motion generation.arXiv preprint arXiv:2411.06481, 2024

    Zeyu Zhang, Hang Gao, Akide Liu, Qi Chen, Feng Chen, Yiran Wang, Danning Li, Rui Zhao, Zhenming Li, Zhongwen Zhou, et al. Kmm: Key frame mask mamba for extended motion generation.arXiv preprint arXiv:2411.06481, 2024

  28. [28]

    Motion mamba: Efficient and long sequence motion generation

    Zeyu Zhang, Akide Liu, Ian Reid, Richard Hartley, Bohan Zhuang, and Hao Tang. Motion mamba: Efficient and long sequence motion generation. InEuropean Conference on Computer Vision, pages 265–282. Springer, 2024

  29. [29]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023. 11

  30. [30]

    Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.Nature Communications, 16(1):7866, 2025

    Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Hui Hui, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.Nature Communications, 16(1):7866, 2025

  31. [31]

    Mavis: A multi-agent framework for long-sequence video storytelling

    Qian Wang, Ziqi Huang, Ruoxi Jia, Paul Debevec, and Ning Yu. Mavis: A multi-agent framework for long-sequence video storytelling. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2273–2295, 2026

  32. [32]

    Multimodal content alignment with llm for visual presentation of papers

    Huiying Hu, Zhicheng He, Yixiao Zhou, Tongwei Zhang, and Xiaoqing Lyu. Multimodal content alignment with llm for visual presentation of papers. InInternational Conference on Document Analysis and Recognition, pages 238–256. Springer, 2025

  33. [33]

    Prege- nie: An agentic framework for high-quality visual presentation generation.arXiv preprint arXiv:2505.21660, 2025

    Xiaojie Xu, Xinli Xu, Sirui Chen, Haoyu Chen, Fan Zhang, and Ying-Cong Chen. Prege- nie: An agentic framework for high-quality visual presentation generation.arXiv preprint arXiv:2505.21660, 2025

  34. [34]

    Present- coach: Dual-agent presentation coaching through exemplars and interactive feedback.arXiv preprint arXiv:2511.15253, 2025

    Sirui Chen, Jinsong Zhou, Xinli Xu, Xiaoyu Yang, Litao Guo, and Ying-Cong Chen. Present- coach: Dual-agent presentation coaching through exemplars and interactive feedback.arXiv preprint arXiv:2511.15253, 2025

  35. [35]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  36. [36]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single trans- former to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

  37. [37]

    Showui: One vision-language-action model for gui visual agent

    Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weix- ian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19498–19508, 2025

  38. [38]

    Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

    Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091, 2023

  39. [39]

    Videostudio: Generating consistent-content and multi-scene videos

    Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei. Videostudio: Generating consistent-content and multi-scene videos. InEuropean Conference on Computer Vision, pages 468–485. Springer, 2024

  40. [40]

    LLM -grounded video diffusion models

    Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, and Boyi Li. Llm-grounded video diffusion models.arXiv preprint arXiv:2309.17444, 2023. 12 A Additional Qualitative Examples As shown in Figures 6 and 7, we provide additional qualitative examples of PresentAgent-2. Each column corresponds to one example. Rows from top to bottom show a representative Sing...