pith. machine review for the scientific record. sign in

arxiv: 2605.13532 · v1 · submitted 2026-05-13 · 💻 cs.AI · cs.CL· cs.CY· cs.HC

Recognition: unknown

AI-Generated Slides: Are They Good? Can Students Tell?

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:53 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CYcs.HC
keywords generative AIAI-generated slideseducational technologystudent perceptionLLMs in educationslide creation
0
0 comments X

The pith

Coding assistant tools create slides from course notes that students rate equal to instructor-made ones and fail to identify as AI.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests generative AI tools for turning instructor notes into lecture slides. Educators reviewed output from NotebookLM, Claude, M365 Copilot, Cursor, and Claude Code for accuracy, completeness, and teaching value. Coding assistants produced the strongest slides, which were lightly adjusted and then shown to students in a real course alongside human-created slides. Students gave the AI versions similar quality ratings and could not pick out which slides were machine-generated at rates above chance. A negative correlation appeared between high quality scores and suspicion that the slides were AI-made.

Core claim

Generative AI tools, especially coding assistants, produce slides from course notes that are accurate, complete, and pedagogically sound. In a live classroom test, students rate these slides as comparable in quality to instructor-created slides and cannot reliably identify their AI origin.

What carries the argument

Side-by-side educator narrative assessment of slides from five GenAI tools, followed by student quality ratings and origin-identification surveys in an actual course.

Load-bearing premise

The light modifications made to the best AI slides before classroom use did not systematically favor the AI versions in the student comparison.

What would settle it

A blind test using completely unmodified AI slides where students identify the AI origin at rates significantly above 50 percent.

read the original abstract

As generative AI (GenAI) tools become easily accessible, there is promise in using such tools to support instructors. To that end, this paper examines using GenAI to help generate slides from instructor authored course notes, emphasizing instructor and student perceptions. We examine an end-to-end education tool (NotebookLM), two general-purpose LLMs (Claude, M365 Copilot), and two coding assistants (Cursor, Claude Code). We first analyze whether GenAI generated slides are ``good'' via narrative assessment by educators. We choose the best slides to use (with some modification) in a real course setting, and compare the student perception of human vs. AI generated slides. We find that coding assistant tools produce slides that were most accurate, complete, and pedagogically sound. Additionally, students rate GenAI slides to be of similar quality as instructor-created slides, and cannot reliably identify which slides are AI-generated. Additionally, we find a negative correlation between a high quality rating and a high ``AI-generated'' rating, suggesting students associate poor quality with the source of the slides being AI. These findings highlight promising opportunities for integrating GenAI into instructional design workflows and call for further research on how educators can best harness such tools responsibly and effectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. This paper evaluates generative AI tools (NotebookLM, Claude, M365 Copilot, Cursor, Claude Code) for creating educational slides from instructor notes. Educator narrative assessments identify coding assistants as producing the most accurate, complete, and pedagogically sound outputs. The top slides undergo light modifications before deployment in a live course, where students rate quality and attempt to identify AI vs. instructor slides. Results show students rate GenAI slides equivalently to human-created ones, cannot reliably distinguish sources, and exhibit a negative correlation between high quality ratings and high AI-attribution ratings.

Significance. If the findings hold, this provides direct empirical support for integrating coding-assistant GenAI into instructional workflows, with ecological validity from the live-course component. The student indistinguishability result and quality-AI correlation offer actionable insights for educators on perception biases, potentially accelerating responsible adoption of AI in pedagogy while highlighting needs for further validation studies.

major comments (3)
  1. [§4.3] §4.3 (Student Deployment): The description states that the best AI slides were used 'with some modification' before classroom deployment, but provides no quantification of change volume, type (e.g., factual corrections, flow edits), or rationale. This is load-bearing for the equivalence and identification claims, as unmeasured edits could have systematically addressed raw GenAI weaknesses, meaning results apply only to post-edit versions rather than unmodified outputs.
  2. [§5.2] §5.2 (Identification Task): The claim that students 'cannot reliably identify' AI-generated slides lacks specification of the statistical test (e.g., proportion test against chance, chi-square), sample size per condition, and effect size or power analysis. Without these, it is impossible to distinguish true indistinguishability from low statistical power, undermining the central student-perception result.
  3. [§3.2] §3.2 (Educator Assessment): The narrative selection of coding assistants as superior relies on educator judgments of 'pedagogically sound' without reported inter-rater reliability, explicit scoring rubric, or example slide excerpts illustrating differences. This reduces transparency for the tool-ranking claim that drives the subsequent student study.
minor comments (3)
  1. [Abstract] Abstract: Report the exact number of slides generated per tool, the specific course topic, and total student sample size to support replicability and generalizability claims.
  2. [§5.1] §5.1 (Quality Ratings): The negative correlation between quality and AI-generated ratings should include the Pearson/Spearman coefficient, p-value, and confidence interval rather than a qualitative description only.
  3. [Results] Figure 2 or equivalent: Ensure axis labels and legends clearly distinguish the five tools and human baseline for the educator assessment results.

Circularity Check

0 steps flagged

No circularity: empirical study rests on independent human judgments

full rationale

The paper conducts an empirical evaluation: GenAI tools generate slides from course notes, educators perform narrative assessments to select the best outputs, light modifications are applied, and the resulting slides are deployed in a real course for student perception and identification surveys. No equations, fitted parameters, predictions, or derivations appear anywhere in the workflow. Claims about accuracy, completeness, pedagogical soundness, quality ratings, and identification rates are grounded directly in the collected human data rather than any self-referential reduction. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz. The central results therefore remain independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard assumptions that educator narrative judgments and student self-reports are valid proxies for slide quality and that the chosen course is representative; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Educator narrative assessments and student Likert ratings accurately reflect pedagogical soundness and perceived quality.
    Invoked when interpreting the tool rankings and student survey results as evidence of slide quality.
  • domain assumption The selected course and student cohort are representative of typical higher-education settings.
    Required to generalize the perception findings beyond the single deployment.

pith-pipeline@v0.9.0 · 5526 in / 1317 out tokens · 34222 ms · 2026-05-14T18:53:30.435428+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

  1. [1]

    Tushar Aggarwal and Aarohi Bhand. 2025. PASS: Presentation Automation for Slide Generation and Speech.arXiv preprint arXiv:2501.06497(2025)

  2. [2]

    Clark and Allan Paivio

    James M. Clark and Allan Paivio. 1991. Dual coding theory and education. Educational Psychology Review3, 3 (1991), 149–210. doi:10.1007/BF01320076

  3. [3]

    Desmarais, and Zhen Ming (Jack) Jiang

    Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, and Zhen Ming (Jack) Jiang. 2023. Github Copilot AI pair programmer: Asset or liability?J. of Systems and Software203 (2023), 111734

  4. [4]

    Paul Denny, Viraj Kumar, and Nasser Giacaman. 2023. Conversing with Copilot: Exploring prompt engineering for solving CS1 problems using natural language. InProceedings of the 54th ACM Technical Symposium on Computer Science Educa- tion V. 1. ACM, 1136–1142

  5. [5]

    James Finnie-Ansley, Paul Denny, Andrew Luxton-Reilly, Eddie Antonio Santos, James Prather, and Brett A. Becker. 2023. My AI Wants to Know if This Will Be on the Exam: Testing OpenAI’s Codex on CS2 Programming Exercises. InProceedings of the 25th Australasian Computing Education Conference. ACM, 97–104

  6. [6]

    Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, and Trevor Darrell. 2025. AutoPresent: Designing Structured Visuals from Scratch. InProc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR)

  7. [7]

    Svetoslav Georgiev and Joseph Tinsley. 2024. Exploring Student Acceptance and Perceptions of AI-Assisted PowerPoint Creation.African Journal of In- ter/Multidisciplinary Studies6, 1 (2024), 1–13

  8. [8]

    Quan Connie Gu, Daniel Hickey, and Kimiko Ryokai. 2025. When AI Tells Their Story: Researchers’ Reactions to AI-Generated Podcasts as a Tool for Commu- nicating Research. InExtended Abstracts of the CHI Conf. on Human Factors in Computing Systems (CHI EA ’25). ACM

  9. [9]

    Michael Henderson, Margaret Bearman, Jennifer Chung, Tim Fawns, Simon Buckingham Shum, Kelly E Matthews, and Jimena de Mello Heredia. 2025. Com- paring Generative AI and teacher feedback: student perceptions of usefulness and trustworthiness.Assessment & Evaluation in Higher Education(2025), 1–16

  10. [10]

    Mollie Jordan, Kevin Ly, and Adalbert Gerald Soosai Raj. 2024. Need a Program- ming Exercise Generated in Your Native Language? ChatGPT’s Got Your Back: Automatic Generation of Non-English Programming Exercises Using OpenAI GPT-3.5. InProc. of the 55th ACM Technical Symposium on Computer Science Education V. 1. Association for Computing Machinery

  11. [11]

    Ranim Khojah, Mazen Mohamad, Philipp Leitner, and Francisco Gomes de Oliveira Neto. 2024. Beyond Code Generation: An Observational Study of Chat- GPT Usage in Software Engineering Practice.Proceedings of the ACM on Software Engineering1, FSE (2024), 1819–1840

  12. [12]

    Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, and Arto Hellas. 2023. Comparing Code Explanations Created by Students and Large Language Models. InProc. of the 2023 Conf. on Innovation and Technology in Computer Science Education V. 1. ACM

  13. [13]

    Zhuoyan Li, Chen Liang, Jing Peng, and Ming Yin. 2024. How Does the Disclosure of AI Assistance Affect the Perceptions of Writing?. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 4849–4868

  14. [14]

    Evanfiya Logacheva, Arto Hellas, James Prather, Sami Sarsa, and Juho Leinonen

  15. [15]

    InProceedings of the 2024 ACM Conference on International Computing Education Research-Volume 1

    Evaluating Contextually Personalized Programming Exercises Created with Generative AI. InProceedings of the 2024 ACM Conference on International Computing Education Research-Volume 1. 95–113

  16. [16]

    Richard E. Mayer. 2005. Cognitive theory of multimedia learning. InThe Cam- bridge Handbook of Multimedia Learning, Richard E. Mayer (Ed.). Cambridge University Press, New York, 31–48

  17. [17]

    Richard E. Mayer. 2017. Using multimedia for e-learning.Journal of computer assisted learning33, 5 (2017), 403–423

  18. [18]

    Shakked Noy and Whitney Zhang. 2023. Experimental evidence on the pro- ductivity effects of generative artificial intelligence.Science381, 6654 (2023), 187–192

  19. [19]

    2025.GPT-5 System Card

    OpenAI. 2025.GPT-5 System Card. https://openai.com/index/gpt-5-system-card/ Accessed on 2025-11-03

  20. [20]

    Vinay Patel. 2024. Fake Or Real? Audio Captures AI Podcast Hosts Realising ‘We’re Not Human... What Happens When They Turn Us Off?’.International Business Times(UK). https://www.ibtimes.co.uk/fake-real-audio-captures-ai- podcast-hosts-realising-were-not-human-what-happens-when-they-1728290 Accessed: 20 April 2026

  21. [21]

    Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.arXiv preprint arXiv:2302.06590(2023)

  22. [22]

    Becker, Ibrahim Albluwi, Michelle Craig, Hieke Keuning, Natalie Kiesler, Tobias Kohn, Andrew Luxton- Reilly, Stephen MacNeil, Andrew Petersen, Raymond Pettit, Brent N

    James Prather, Paul Denny, Juho Leinonen, Brett A. Becker, Ibrahim Albluwi, Michelle Craig, Hieke Keuning, Natalie Kiesler, Tobias Kohn, Andrew Luxton- Reilly, Stephen MacNeil, Andrew Petersen, Raymond Pettit, Brent N. Reeves, and Jaromir Savelka. 2023. The Robots Are Here: Navigating the Generative AI Revolution in Computing Education. InProc. of the 202...

  23. [23]

    Reeves, Jaromir Savelka, IV Smith, David H., Sven Strickroth, and Daniel Zingaro

    James Prather, Juho Leinonen, Natalie Kiesler, Jamie Gorson Benario, Sam Lau, Stephen MacNeil, Narges Norouzi, Simone Opel, Vee Pettit, Leo Porter, Brent N. Reeves, Jaromir Savelka, IV Smith, David H., Sven Strickroth, and Daniel Zingaro

  24. [24]

    In2024 Working Group Reports on Innovation and Technology in Computer Science Education

    Beyond the Hype: A Comprehensive Review of Current Trends in Genera- tive AI Research, Teaching Practices, and Tools. In2024 Working Group Reports on Innovation and Technology in Computer Science Education. ACM, 300–338

  25. [25]

    Kunal Rao, Giuseppe Coviello, Murugan Sankaradas, Ciro Giuseppe De Vita, Gennaro Mellone, and Srimat Chakradhar. 2025. SlideCraft: Context-aware Slides Generation Agent. In2025 IEEE Conference on Pervasive and Intelligent Computing (PICom). IEEE, 165–172

  26. [26]

    Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic Gen- eration of Programming Exercises and Code Explanations Using Large Language Models. InProceedings of the 2022 ACM Conference on International Computing Education Research-Volume 1. ACM

  27. [27]

    Jaromir Savelka, Arav Agarwal, Marshall An, Christopher Bogart, and Majd Sakr

  28. [28]

    Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses. InProc. of the 2023 ACM Conf. on Int. Computing Education Research-Volume 1. ACM

  29. [29]

    Lee Giles

    Athar Sefid, Jian Wu, Prasenjit Mitra, and C. Lee Giles. 2019. Automatic slide generation for scientific papers. InThird International Workshop on Capturing Sci- entific Knowledge co-located with the 10th International Conference on Knowledge Capture (K-CAP 2019), SciKnow@ K-CAP 2019

  30. [30]

    Juha Sorva. 2013. Notional machines and introductory programming education. ACM Transactions on Computing Education (TOCE)13, 2 (2013), 1–31. doi:10. 1145/2483710.2483713

  31. [31]

    John Sweller. 1988. Cognitive load during problem solving: Effects on learning. Cognitive Science12, 2 (1988), 257–285. doi:10.1207/s15516709cog1202_4

  32. [32]

    Ismael Villegas Molina, Audria Montalvo, Shera Zhong, Mollie Jordan, and Adal- bert Gerald Soosai Raj. 2024. Generation and Evaluation of a Culturally-Relevant CS1 Textbook for Latines using Large Language Models. InProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1. 325–331

  33. [33]

    Biao Wang. 2024. NotebookLM now lets you listen to a conversation about your sources.Google Blog. September11 (2024)

  34. [34]

    Yu, and Qingsong Wen

    Shen Wang, Tianlong Xu, Hang Li, Chaoli Zhang, Joleen Liang, Jiliang Tang, Philip S. Yu, and Qingsong Wen. 2025. Large Language Models for Education: A survey and outlook.IEEE Signal Processing Magazine42, 6 (2025), 51–63

  35. [35]

    Leon E. Winslow. 1996. Programming pedagogy—a psychological overview.ACM SIGCSE Bulletin28, 3 (1996), 17–22. doi:10.1145/234867.234872

  36. [36]

    Eric Xie, Danielle Waterfield, Michael Kennedy, and Aidong Zhang. 2026. SlideBot: A Multi-Agent Framework for Generating Informative, Reliable, Multi-Modal Presentations. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 40907–40915

  37. [37]

    Weicheng Xing, Tianqing Zhu, Jenny Wang, and Bo Liu. 2024. A Survey on MLLMs in Education: Application and Future Directions.Future Internet(2024)