X+Slides: Benchmarking Audience-Conditioned Slide Generation

Anya Jia; Bo Wang; Fan Wu; Haodong Chen; Jiawei Hong; Wei Zhou; Xinyue Shao; Xuanhe Zhou; Yanbing Zhu

arxiv: 2606.19256 · v1 · pith:D3VEP37Unew · submitted 2026-06-17 · 💻 cs.AI

X+Slides: Benchmarking Audience-Conditioned Slide Generation

Haodong Chen , Xuanhe Zhou , Wei Zhou , Xinyue Shao , Yanbing Zhu , Bo Wang , Jiawei Hong , Anya Jia

show 1 more author

Fan Wu

This is my paper

Pith reviewed 2026-06-26 20:36 UTC · model grok-4.3

classification 💻 cs.AI

keywords slide generationaudience conditioningLLM benchmarkssource-grounded evaluationpresentation generationutility metricsaudience coverage

0 comments

The pith

A new benchmark shows current slide generators recover only a partial share of the information different audiences need from the same source.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents X+Slides as a benchmark for slide generation that factors in the target audience rather than treating all viewers the same. It builds an evaluation set of 8,133 source-grounded probes drawn from 113 topics and seven scenes, then applies different utility weights to those probes depending on whether the audience consists of specialists or decision-makers. Experiments on three existing systems find that even the strongest performer covers at most 71 percent of the weighted audience-essential content at a moderate threshold, while also revealing differences in how well claims stay grounded in the source material. The results indicate that visual quality and broad topic coverage alone do not demonstrate that slides deliver what a given audience actually requires.

Core claim

Audience-conditioned slide generation must be measured with source-grounded probes that carry audience-specific utility weights, because existing systems recover only an incomplete portion of the information each audience type needs; at a threshold of 0.7, the best observed Audience Coverage is 0.714 for one system, 0.594 for another, and 0.853 for a NotebookLM ablation that still exhibits grounding shortfalls.

What carries the argument

Dynamic evaluation framework of 8,133 deduplicated source-grounded probes with audience-specific utility weights, used to compute Audience Coverage, Domain-wise Coverage, Efficiency, and Correctness.

If this is right

Audience Coverage quantifies the fraction of essential information conveyed to a specific listener group.
Systems must be assessed on source grounding rather than visual quality or topic breadth alone.
NotebookLM-style ablations can reach higher coverage numbers but still differ in how claims trace back to the source.
Four complementary metrics together give a fuller picture than completeness checks used in prior benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future generators could incorporate audience type as an explicit input when selecting which source details to emphasize.
The probe-weighting method could be adapted to evaluate other audience-sensitive outputs such as reports or summaries.
Extending the benchmark to live presentation feedback might reveal whether coverage scores predict real listener comprehension.

Load-bearing premise

Audience-specific utility weights assigned to the same source-grounded probes accurately capture real-world differences in what different audiences need from the same source material.

What would settle it

A direct comparison in which target audiences rate the practical utility of generated slides and the ratings fail to correlate with the benchmark's weighted Audience Coverage scores.

Figures

Figures reproduced from arXiv: 2606.19256 by Anya Jia, Bo Wang, Fan Wu, Haodong Chen, Jiawei Hong, Wei Zhou, Xinyue Shao, Xuanhe Zhou, Yanbing Zhu.

**Figure 1.** Figure 1: Workflow of X+Slides for slides evaluation. Given source documents, audience profiles, usage scenarios, and generated slide decks, X+Slides applies specific probe weights, verifies sourcesupported answerability, and reports audience-conditioned evaluation metrics. between source-level completeness and audience-specific utility. A generated deck may include many correct facts but miss the information its a… view at source ↗

**Figure 2.** Figure 2: Benchmark construction pipeline. X+Slides collects and parses diverse sources, builds deduplicated evidence-backed probes, assigns metadata, and attaches audience utility weights. attribute score differences to generation quality. Therefore, X+Slides first constructs an audienceagnostic probe bank from the source document, and then applies audience-specific utility weights. This ensures that different aud… view at source ↗

**Figure 3.** Figure 3: Source-topic composition of X+Slides, split by academic fields, non-academic categories, and audience-conditioned probe scale. of slides, word count, token count, and estimated presentation duration. Full prompt templates are provided in the appendix and released with our codebase1 . 5 Benchmark Analysis This section details the core statistical properties of the X+Slides benchmark, illustrating its scale,… view at source ↗

**Figure 4.** Figure 4: Deck showcase for The 5 Principles of Growth in B2B Marketing. This full-profile example [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Deck showcase for Airbnb 2024 Annual Report. This annual-report example shows [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Deck showcase for Berkshire Hathaway 2025 Annual Report. This agnostic decision [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Deck showcase for CDC Clear Communication Index User Guide. This communication [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Deck showcase for Climate Change 2023: AR6 Synthesis Report. This climate-policy [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Deck showcase for OECD Regulatory Policy Outlook 2021. This governance-report [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Deck showcase for G20/OECD Principles of Corporate Governance 2023. This corporate [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

Automatically generating slide decks from source documents is an important application of large language models (LLMs). Existing benchmarks primarily assess slide completeness and technical depth, while overlooking the target audience as a critical real-world factor. For instance, specialists demand rigorous proofs, whereas decision-makers prioritize actionable conclusions. To bridge this gap, we introduce X+Slides, a benchmark specifically designed for audience-conditioned slide generation. Built on a diverse corpus spanning 113 topics and seven presentation scenes, X+Slides employs a dynamic evaluation framework constructed from 8,133 deduplicated, source-grounded probes. By assigning audience-specific utility weights to the same source-grounded probes, X+Slides reports four complementary metrics: Audience Coverage measures how much audience-essential information is conveyed, Domain-wise Coverage shows which information types are covered, Efficiency measures delivered utility per unit of attention cost, and Correctness verifies whether slide claims are supported by the source. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems can recover a substantial but still incomplete part of audience-essential information: at $\tau_A=0.7$, DeepPresenter reaches a best Audience Coverage of 0.714, SlideTailor reaches 0.594, and the NotebookLM ablation reaches 0.853 while showing clear grounding differences. These results indicate that visual quality and broad topic coverage should not be treated as evidence support without source-grounded evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

X+Slides adds a useful benchmark for audience-aware slide generation but the audience utility weights rest on unvalidated assignments that undercut the main claims.

read the letter

The paper introduces X+Slides, a benchmark built on 8,133 source-grounded probes across 113 topics, with four metrics that include audience coverage after weighting the same probes differently for specialists versus decision-makers. It tests three systems and reports concrete numbers like 0.714, 0.594, and 0.853 at τ_A=0.7. That setup is new and directly targets a real gap: most slide benchmarks ignore who the slides are for.

What works is the source-grounded probe construction and the decision to measure efficiency and correctness alongside coverage. The results on NotebookLM versus the other two systems highlight that broad topic coverage alone does not guarantee audience fit, which is a fair point.

The soft spot is the audience-specific utility weights. The abstract says they are assigned to the probes, yet gives no procedure, no inter-rater numbers, and no check against actual user data. If those weights come from author judgment rather than measured audience preferences, the coverage gaps become hard to interpret as evidence about the systems. The comparative claims rest on that step.

This is the kind of paper that belongs in a reading group focused on applied LLM evaluation. It is worth referee time because the benchmark idea is practical and the evaluation framework is reproducible in principle, but any review should press hard on how the weights were derived and whether they hold up outside the authors' choices. I would not cite it until that part is tightened.

Referee Report

2 major / 2 minor

Summary. The paper introduces X+Slides, a benchmark for audience-conditioned slide generation from source documents. It constructs a corpus over 113 topics and seven scenes, defines 8,133 deduplicated source-grounded probes, assigns audience-specific utility weights to those probes, and reports four metrics (Audience Coverage, Domain-wise Coverage, Efficiency, Correctness). Experiments on DeepPresenter, SlideTailor, and a NotebookLM ablation report Audience Coverage values of 0.714, 0.594, and 0.853 at τ_A=0.7 and conclude that existing systems recover only a partial fraction of audience-essential information, so visual quality and broad topic coverage are insufficient without source-grounded, audience-weighted evaluation.

Significance. If the utility-weight assignment procedure and probe validation are made reproducible and externally validated, the benchmark would supply a concrete, audience-differentiated evaluation framework that existing slide-generation papers largely lack. The reported coverage gaps would then constitute a falsifiable, quantitative demonstration that generic completeness metrics are inadequate.

major comments (2)

[Abstract] Abstract: the Audience Coverage metric is defined as the sum of audience-specific utility weights on recovered probes, yet the abstract (and, from the provided description, the methods) gives no account of how those weights were assigned, what inter-rater protocol was used, or any external validation against real audience responses (specialists vs. decision-makers). Because the comparative claims rest directly on the numerical differences produced by these weights, the omission renders the reported scores (0.714, 0.594, 0.853) uninterpretable as evidence of incomplete audience conditioning.
[Benchmark construction] Benchmark construction (implied in abstract): the paper states that 8,133 probes are “deduplicated, source-grounded,” but supplies no description of probe generation, deduplication criteria, or any validation that the probes actually capture audience-essential information rather than author-chosen content. This construction step is load-bearing for all four metrics.

minor comments (2)

[Abstract] The symbol τ_A is used without definition in the abstract; a brief parenthetical or footnote would clarify the threshold.
[Abstract] The phrase “treated as evidence support” appears to be a minor phrasing error; “treated as evidence of support” or “treated as supporting evidence” would be clearer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important gaps in the description of our benchmark construction and metric definitions. We address each major comment below and will incorporate the requested details into a revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the Audience Coverage metric is defined as the sum of audience-specific utility weights on recovered probes, yet the abstract (and, from the provided description, the methods) gives no account of how those weights were assigned, what inter-rater protocol was used, or any external validation against real audience responses (specialists vs. decision-makers). Because the comparative claims rest directly on the numerical differences produced by these weights, the omission renders the reported scores (0.714, 0.594, 0.853) uninterpretable as evidence of incomplete audience conditioning.

Authors: We agree that the abstract and methods description omit the necessary details on utility weight assignment, inter-rater protocol, and external validation, which limits interpretability of the Audience Coverage scores. The full manuscript provides only a high-level statement that weights are audience-specific. In revision we will add a dedicated subsection describing the weight assignment process, including annotation guidelines, number of raters, inter-rater agreement, and any validation steps performed against real audience distinctions (specialists versus decision-makers). revision: yes
Referee: [Benchmark construction] Benchmark construction (implied in abstract): the paper states that 8,133 probes are “deduplicated, source-grounded,” but supplies no description of probe generation, deduplication criteria, or any validation that the probes actually capture audience-essential information rather than author-chosen content. This construction step is load-bearing for all four metrics.

Authors: We agree that the manuscript does not supply the required details on probe generation, deduplication criteria, or validation that the probes capture audience-essential information. The current text only asserts that the probes are deduplicated and source-grounded. We will revise the benchmark construction section to include a complete description of the probe generation pipeline, the deduplication criteria and thresholds employed, and the validation procedures (such as expert review or pilot studies) used to confirm that probes reflect audience-essential content rather than arbitrary selections. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark metrics defined independently from external probes

full rationale

The paper introduces X+Slides as an evaluation framework built on 8,133 source-grounded probes across topics and scenes. Metrics (Audience Coverage, Domain-wise Coverage, Efficiency, Correctness) are computed by applying assigned audience-specific utility weights to these probes. No equations, derivations, or predictions are present that reduce to fitted parameters or self-referential inputs. The weights are stated as assigned inputs to the framework rather than outputs derived from the evaluated systems. No self-citations justify core premises, uniqueness theorems, or ansatzes. The experiments apply the benchmark to external systems (DeepPresenter, SlideTailor, NotebookLM) without the results feeding back into the metric definitions. The chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that the chosen probes and weights are representative.

axioms (1)

domain assumption Audience-specific utility weights accurately reflect real differences in audience information needs
The four metrics depend on these weights being meaningful.

pith-pipeline@v0.9.1-grok · 5803 in / 1370 out tokens · 38649 ms · 2026-06-26T20:36:55.739927+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 3 linked inside Pith

[1]

Autopresent: Designing structured visuals from scratch,

J. Ge, Z. Z. Wang, X. Zhou, Y .-H. Peng, S. Subramanian, Q. Tan, M. Sap, A. Suhr, D. Fried, G. Neubiget al., “Autopresent: Designing structured visuals from scratch,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2902–2911

2025
[2]

Slidecoder: Layout-aware rag-enhanced hierarchical slide generation from design,

W. Tang, J. Xiao, W. Jiang, X. Xiao, Y . Wang, X. Tang, Q. Li, Y . Ma, J. Liu, S. Tanget al., “Slidecoder: Layout-aware rag-enhanced hierarchical slide generation from design,” inProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 9026–9050

2025
[3]

Pptagent: Generating and evaluating presentations beyond text-to-slides,

H. Zheng, X. Guan, H. Kong, W. Zhang, J. Zheng, W. Zhou, H. Lin, Y . Lu, X. Han, and L. Sun, “Pptagent: Generating and evaluating presentations beyond text-to-slides,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 14 413–14 429

2025
[4]

Deeppresenter: Environment-grounded reflection for agentic presentation generation,

H. Zheng, G. Mo, X. Yan, Q. Yuan, W. Zhang, X. Chen, Y . Lu, H. Lin, X. Han, and L. Sun, “Deeppresenter: Environment-grounded reflection for agentic presentation generation,”arXiv preprint arXiv:2602.22839, 2026

Pith/arXiv arXiv 2026
[5]

Slidebot: A multi-agent framework for gen- erating informative, reliable, multi-modal presentations,

E. Xie, D. Waterfield, M. Kennedy, and A. Zhang, “Slidebot: A multi-agent framework for gen- erating informative, reliable, multi-modal presentations,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 48, 2026, pp. 40 907–40 915

2026
[6]

Deckbench: Benchmarking multi-agent frameworks for academic slide generation and editing,

D. Jang, M. L. Heisler, L. Xing, Y . Li, E. Wang, Y . Xiong, Y . Zhang, and Z. Fan, “Deckbench: Benchmarking multi-agent frameworks for academic slide generation and editing,”arXiv preprint arXiv:2602.13318, 2026

Pith/arXiv arXiv 2026
[7]

Slidetailor: Personalized presentation slide gener- ation for scientific papers,

W. Zeng, M. Ouyang, L. Cui, and H. T. Ng, “Slidetailor: Personalized presentation slide gener- ation for scientific papers,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 41, 2026, pp. 34 584–34 592

2026
[8]

Narrative-driven paper-to-slide generation via arcdeck,

T. C. Ozden, S. VS, F. Horoz, O. Kara, J. Kim, and J. M. Rehg, “Narrative-driven paper-to-slide generation via arcdeck,”arXiv preprint arXiv:2604.11969, 2026

Pith/arXiv arXiv 2026
[9]

Auto-slides: An interac- tive multi-agent system for creating and customizing research presentations,

Y . Yang, W. Jiang, Y . Wang, Y . Song, Y . Wang, and C. Zhang, “Auto-slides: An interac- tive multi-agent system for creating and customizing research presentations,”arXiv preprint arXiv:2509.11062, 2025

arXiv 2025
[10]

Presentbench: A fine-grained rubric-based benchmark for slide generation,

X.-S. Chen, J. Zhu, P.-l. Li, H. Wang, S. Yang, and M.-H. Guo, “Presentbench: A fine-grained rubric-based benchmark for slide generation,”arXiv preprint arXiv:2603.07244, 2026

arXiv 2026
[11]

Slidesgen- bench: Evaluating slides generation via computational and quantitative metrics,

Y . Yang, W. Li, H. Ren, Z. Lu, K. Wang, Z. Huang, Z. Zong, M. Zhan, and H. Li, “Slidesgen- bench: Evaluating slides generation via computational and quantitative metrics,”arXiv preprint arXiv:2601.09487, 2026

arXiv 2026
[12]

Pptarena: A benchmark for agentic powerpoint editing,

M. Ofengenden, Y . Man, Z. Pang, and Y .-X. Wang, “Pptarena: A benchmark for agentic powerpoint editing,”arXiv preprint arXiv:2512.03042, 2025

arXiv 2025
[13]

Pptbench: Towards holistic evaluation of large language models for powerpoint layout and design understanding,

Z. Huang, X. Liu, T. Hu, K. Zhang, and Y . Liu, “Pptbench: Towards holistic evaluation of large language models for powerpoint layout and design understanding,”arXiv preprint arXiv:2512.02624, 2025

arXiv 2025
[14]

Paper2poster: Towards multimodal poster automation from scientific papers,

W. Pang, K. Q. Lin, X. Jian, X. He, and P. Torr, “Paper2poster: Towards multimodal poster automation from scientific papers,”arXiv preprint arXiv:2505.21497, 2025

arXiv 2025
[15]

Personalens: A benchmark for personalization evaluation in conversational ai assistants,

Z. Zhao, C. Vania, S. Kayal, N. Khan, S. B. Cohen, and E. Yilmaz, “Personalens: A benchmark for personalization evaluation in conversational ai assistants,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 18 023–18 055

2025
[16]

Cfbench: A comprehensive constraints-following benchmark for llms,

T. Zhang, C. Zhu, Y . Shen, W. Luo, Y . Zhang, H. Liang, F. Yang, M. Lin, Y . Qiao, W. Chen et al., “Cfbench: A comprehensive constraints-following benchmark for llms,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 32 926–32 944

2025
[17]

Judge anything: Mllm as a judge across any modality,

S. Pu, Y . Wang, D. Chen, Y . Chen, G. Wang, Q. Qin, Z. Zhang, Z. Zhang, Z. Zhou, S. Gong et al., “Judge anything: Mllm as a judge across any modality,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 2025, pp. 5742–5753. 14 A Appendix A.1 Source Document Details We provide the source document composition in S...

2025

[1] [1]

Autopresent: Designing structured visuals from scratch,

J. Ge, Z. Z. Wang, X. Zhou, Y .-H. Peng, S. Subramanian, Q. Tan, M. Sap, A. Suhr, D. Fried, G. Neubiget al., “Autopresent: Designing structured visuals from scratch,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2902–2911

2025

[2] [2]

Slidecoder: Layout-aware rag-enhanced hierarchical slide generation from design,

W. Tang, J. Xiao, W. Jiang, X. Xiao, Y . Wang, X. Tang, Q. Li, Y . Ma, J. Liu, S. Tanget al., “Slidecoder: Layout-aware rag-enhanced hierarchical slide generation from design,” inProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 9026–9050

2025

[3] [3]

Pptagent: Generating and evaluating presentations beyond text-to-slides,

H. Zheng, X. Guan, H. Kong, W. Zhang, J. Zheng, W. Zhou, H. Lin, Y . Lu, X. Han, and L. Sun, “Pptagent: Generating and evaluating presentations beyond text-to-slides,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 14 413–14 429

2025

[4] [4]

Deeppresenter: Environment-grounded reflection for agentic presentation generation,

H. Zheng, G. Mo, X. Yan, Q. Yuan, W. Zhang, X. Chen, Y . Lu, H. Lin, X. Han, and L. Sun, “Deeppresenter: Environment-grounded reflection for agentic presentation generation,”arXiv preprint arXiv:2602.22839, 2026

Pith/arXiv arXiv 2026

[5] [5]

Slidebot: A multi-agent framework for gen- erating informative, reliable, multi-modal presentations,

E. Xie, D. Waterfield, M. Kennedy, and A. Zhang, “Slidebot: A multi-agent framework for gen- erating informative, reliable, multi-modal presentations,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 48, 2026, pp. 40 907–40 915

2026

[6] [6]

Deckbench: Benchmarking multi-agent frameworks for academic slide generation and editing,

D. Jang, M. L. Heisler, L. Xing, Y . Li, E. Wang, Y . Xiong, Y . Zhang, and Z. Fan, “Deckbench: Benchmarking multi-agent frameworks for academic slide generation and editing,”arXiv preprint arXiv:2602.13318, 2026

Pith/arXiv arXiv 2026

[7] [7]

Slidetailor: Personalized presentation slide gener- ation for scientific papers,

W. Zeng, M. Ouyang, L. Cui, and H. T. Ng, “Slidetailor: Personalized presentation slide gener- ation for scientific papers,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 41, 2026, pp. 34 584–34 592

2026

[8] [8]

Narrative-driven paper-to-slide generation via arcdeck,

T. C. Ozden, S. VS, F. Horoz, O. Kara, J. Kim, and J. M. Rehg, “Narrative-driven paper-to-slide generation via arcdeck,”arXiv preprint arXiv:2604.11969, 2026

Pith/arXiv arXiv 2026

[9] [9]

Auto-slides: An interac- tive multi-agent system for creating and customizing research presentations,

Y . Yang, W. Jiang, Y . Wang, Y . Song, Y . Wang, and C. Zhang, “Auto-slides: An interac- tive multi-agent system for creating and customizing research presentations,”arXiv preprint arXiv:2509.11062, 2025

arXiv 2025

[10] [10]

Presentbench: A fine-grained rubric-based benchmark for slide generation,

X.-S. Chen, J. Zhu, P.-l. Li, H. Wang, S. Yang, and M.-H. Guo, “Presentbench: A fine-grained rubric-based benchmark for slide generation,”arXiv preprint arXiv:2603.07244, 2026

arXiv 2026

[11] [11]

Slidesgen- bench: Evaluating slides generation via computational and quantitative metrics,

Y . Yang, W. Li, H. Ren, Z. Lu, K. Wang, Z. Huang, Z. Zong, M. Zhan, and H. Li, “Slidesgen- bench: Evaluating slides generation via computational and quantitative metrics,”arXiv preprint arXiv:2601.09487, 2026

arXiv 2026

[12] [12]

Pptarena: A benchmark for agentic powerpoint editing,

M. Ofengenden, Y . Man, Z. Pang, and Y .-X. Wang, “Pptarena: A benchmark for agentic powerpoint editing,”arXiv preprint arXiv:2512.03042, 2025

arXiv 2025

[13] [13]

Pptbench: Towards holistic evaluation of large language models for powerpoint layout and design understanding,

Z. Huang, X. Liu, T. Hu, K. Zhang, and Y . Liu, “Pptbench: Towards holistic evaluation of large language models for powerpoint layout and design understanding,”arXiv preprint arXiv:2512.02624, 2025

arXiv 2025

[14] [14]

Paper2poster: Towards multimodal poster automation from scientific papers,

W. Pang, K. Q. Lin, X. Jian, X. He, and P. Torr, “Paper2poster: Towards multimodal poster automation from scientific papers,”arXiv preprint arXiv:2505.21497, 2025

arXiv 2025

[15] [15]

Personalens: A benchmark for personalization evaluation in conversational ai assistants,

Z. Zhao, C. Vania, S. Kayal, N. Khan, S. B. Cohen, and E. Yilmaz, “Personalens: A benchmark for personalization evaluation in conversational ai assistants,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 18 023–18 055

2025

[16] [16]

Cfbench: A comprehensive constraints-following benchmark for llms,

T. Zhang, C. Zhu, Y . Shen, W. Luo, Y . Zhang, H. Liang, F. Yang, M. Lin, Y . Qiao, W. Chen et al., “Cfbench: A comprehensive constraints-following benchmark for llms,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 32 926–32 944

2025

[17] [17]

Judge anything: Mllm as a judge across any modality,

S. Pu, Y . Wang, D. Chen, Y . Chen, G. Wang, Q. Qin, Z. Zhang, Z. Zhang, Z. Zhou, S. Gong et al., “Judge anything: Mllm as a judge across any modality,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 2025, pp. 5742–5753. 14 A Appendix A.1 Source Document Details We provide the source document composition in S...

2025