Teaching Prompt-Based Programming with LLMs: A 45-Minute Lesson with Guided Practice for End-User Programmers
Pith reviewed 2026-06-30 03:23 UTC · model grok-4.3
The pith
A 45-minute lesson on prompt-based programming produces modest gains in students' ability to specify computational goals to LLMs and larger gains in self-efficacy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our results suggest it is likely that a brief intervention can improve learners' ability to specify computational goals to LLMs. However, the effect was modest, suggesting that prompting skills may require more time and practice to develop. We provide a lightweight lesson that requires no prior CS background and can be readily dropped into existing courses.
What carries the argument
The 45-minute prompt-based programming lesson with guided practice, evaluated through randomized comparison against a code-tracing control on pre-post changes in goal-specification performance and self-efficacy.
Load-bearing premise
The pre- and post-tests accurately measure the ability to specify computational goals to LLMs, and the control condition represents a class without prompt-focused instruction.
What would settle it
A larger replication study finding no difference in pre-to-post gains on the goal-specification test between the prompting lesson and code-tracing groups would indicate the intervention does not improve the targeted skill.
Figures
read the original abstract
Prompt-based programming, a new modality enabled by large language models (LLMs), allows users to express computational goals through natural language rather than traditional code. While this approach lowers barriers to entry, especially for non-CS learners, it does not eliminate the need for foundational CS skills. Learners often struggle to communicate their intent clearly to LLMs, resulting in vague or underspecified prompts. Prior work has documented the need for explicit prompting for both CS and non-CS learners. However, it remains less clear how such instruction can fit into busy classrooms or how much time is needed to produce meaningful gains. In this paper, we evaluated a 45-minute prompt-based programming intervention, consisting of a lesson with guided practice, against a business-as-usual CS lab activity (code tracing) of equal length, representing a class without prompt-focused instruction. We conducted a randomized controlled study with 55 engineering students. We found that students in the experimental condition improved more on average (though not significantly more) from pre- to post-test than the control group (+10.8 vs +1.1 percentage points) and showed significantly greater average gains in prompting self-efficacy (+35.4 vs +21.9 percentage points). Our results suggest it is likely that a brief intervention can improve learners' ability to specify computational goals to LLMs. However, the effect was modest, suggesting that prompting skills may require more time and practice to develop. We provide a lightweight lesson that requires no prior CS background and can be readily dropped into existing courses.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports results from a randomized controlled study (n=55 engineering students) comparing a 45-minute prompt-based programming lesson with guided practice against an equal-length code-tracing control activity. The experimental group showed larger average pre-to-post gains on a test of specifying computational goals to LLMs (+10.8 vs +1.1 percentage points, not statistically significant) and significantly larger gains in prompting self-efficacy (+35.4 vs +21.9 percentage points). The authors conclude that a brief intervention can produce modest improvements in learners' ability to specify goals to LLMs but that more time and practice may be needed.
Significance. If the central empirical result holds after addressing measurement concerns, the work provides a practical, lightweight lesson plan that can be inserted into existing non-CS courses and supplies RCT evidence on the time investment required for prompting skill development. The equal-duration control condition and direct measurement of both performance and self-efficacy are strengths that allow clear comparison to business-as-usual instruction.
major comments (2)
- [Abstract / Results] Abstract and Results: The primary claim that the intervention improves ability to specify computational goals rests on the +10.8 vs +1.1 pp performance difference, yet this difference is reported as non-significant. With n=55 the study is under-powered for detecting modest effects, so the ability conclusion should be qualified or supported by additional analyses (e.g., effect-size reporting, per-item breakdowns, or power calculations).
- [Methods] Methods: No item-level description of the pre/post tests, scoring rubric, inter-rater reliability, or external validation (e.g., correlation with actual LLM prompt quality) is provided. Without these details it is impossible to confirm that the tests validly measure the target construct of goal-specification ability rather than general test-taking or domain knowledge.
minor comments (1)
- [Abstract] The abstract states the performance gain is 'not significantly more' while the self-efficacy gain is 'significantly greater'; these qualifiers should be repeated consistently in the discussion to avoid over-statement.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of results and methods.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results: The primary claim that the intervention improves ability to specify computational goals rests on the +10.8 vs +1.1 pp performance difference, yet this difference is reported as non-significant. With n=55 the study is under-powered for detecting modest effects, so the ability conclusion should be qualified or supported by additional analyses (e.g., effect-size reporting, per-item breakdowns, or power calculations).
Authors: We agree that the performance gain difference is non-significant and that the study is likely underpowered to detect modest effects. The manuscript already qualifies the conclusion by stating the difference is 'not significantly more' and describing the effect as 'modest,' with the explicit suggestion that more time and practice may be needed. To further address the concern, we will add effect-size reporting (Cohen's d for the performance and self-efficacy measures), a post-hoc power analysis, and a brief per-item breakdown of the test in the Results section. These additions will support the qualified claims without overstating the findings. revision: yes
-
Referee: [Methods] Methods: No item-level description of the pre/post tests, scoring rubric, inter-rater reliability, or external validation (e.g., correlation with actual LLM prompt quality) is provided. Without these details it is impossible to confirm that the tests validly measure the target construct of goal-specification ability rather than general test-taking or domain knowledge.
Authors: We acknowledge the absence of these methodological details in the current version. We will expand the Methods section to include sample pre/post test items, the full scoring rubric, inter-rater reliability statistics (two independent raters scored all responses), and a description of how the test was designed to isolate goal-specification ability. We will also note the lack of direct external validation against LLM-generated prompt quality as a limitation and discuss the test's alignment with the target construct. revision: yes
Circularity Check
No circularity: direct empirical RCT with measured outcomes
full rationale
The paper is a randomized controlled study reporting pre/post score changes from an educational intervention. No equations, derivations, fitted parameters, or self-citation chains exist that reduce any result to its own inputs by construction. Claims rest on direct measurement of test scores and self-efficacy, which are independent of any prior fitted quantities within the paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The pre- and post-tests validly measure learners' ability to specify computational goals to LLMs
- domain assumption Random assignment to conditions produced balanced groups
Reference graph
Works this paper leans on
-
[1]
Thom Baguley. 2009. Standardized or simple effect size: What should be reported? British journal of psychology100, 3 (2009), 603–617
2009
-
[2]
Giang Bui, Naaz Sibia, Angela Zavaleta Bernuy, Michael Liut, and Andrew Pe- tersen. 2023. Prior Programming Experience: A Persistent Performance Gap in CS1 and CS2. InProceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1(Toronto ON, Canada)(SIGCSE 2023). Association for Com- puting Machinery, New York, NY, USA, 889–895. doi:10...
-
[3]
Paul Denny, Juho Leinonen, James Prather, Andrew Luxton-Reilly, Thezyrie Amarouche, Brett A. Becker, and Brent N. Reeves. 2023. Promptly: Using Prompt Problems to Teach Learners How to Effectively Utilize AI Code Generators. arXiv:2307.16364 [cs.HC] https://arxiv.org/abs/2307.16364
-
[4]
Paul Denny, Juho Leinonen, James Prather, Andrew Luxton-Reilly, Thezyrie Amarouche, Brett A. Becker, and Brent N. Reeves. 2024. Prompt Problems: A New Programming Exercise for the Generative AI Era. InProceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1(Portland, OR, USA)(SIGCSE 2024). Association for Computing Machinery, New...
-
[5]
Rodrigo Silva Duran, Jan-Mikael Rybicki, Arto Hellas, and Sanna Suoranta. 2019. Towards a Common Instrument for Measuring Prior Programming Knowledge. InProceedings of the 2019 ACM Conference on Innovation and Technology in Computer Science Education(Aberdeen, Scotland Uk)(ITiCSE ’19). Association for Computing Machinery, New York, NY, USA, 443–449. doi:1...
-
[6]
Janet Feigenspan, Christian Kästner, Jörg Liebig, Sven Apel, and Stefan Hanen- berg. 2012. Measuring programming experience. In2012 20th IEEE International Conference on Program Comprehension (ICPC). 73–82. doi:10.1109/ICPC.2012. 6240511
-
[7]
Molly Q Feldman and Carolyn Jane Anderson. 2024. Non-Expert Programmers in the Generative AI Future. InProceedings of the 3rd Annual Meeting of the Symposium on Human-Computer Interaction for Work(Newcastle upon Tyne, United Kingdom)(CHIWORK ’24). Association for Computing Machinery, New York, NY, USA, Article 15, 19 pages. doi:10.1145/3663384.3663393
-
[8]
Philip J. Guo. 2013. Online python tutor: embeddable web-based program visual- ization for cs education. InProceeding of the 44th ACM Technical Symposium on Computer Science Education(Denver, Colorado, USA)(SIGCSE ’13). Association for Computing Machinery, New York, NY, USA, 579–584. doi:10.1145/2445196. 2445368
-
[9]
Jinyoung Hur and Kathryn Cunningham. 2024. Profiling Conversational Pro- grammers at University: Insights into their Motivations and Goals from a Broad Sample of Non-Majors. InProceedings of the 2024 ACM Conference on Interna- tional Computing Education Research - Volume 1(Melbourne, VIC, Australia) (ICER ’24). Association for Computing Machinery, New Yor...
-
[10]
Ellen Jiang, Kristen Olson, Edwin Toh, Alejandra Molina, Aaron Donsbach, Michael Terry, and Carrie J Cai. 2022. PromptMaker: Prompt-based Prototyping with Large Language Models. InExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems(New Orleans, LA, USA)(CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Ar...
-
[11]
Yu-Jeng Ju, Yi-Ching Wang, Shih-Chieh Lee, Cheng-Heng Liu, Jen-Hsuan Liu, Chih-Wei Yang, and Ching-Lin Hsieh. 2025. Developing the questionnaire of self-efficacy and needs in using large-language model-based AI services.Current Psychology(2025), 1–19
2025
-
[12]
Majeed Kazemitabaar, Xinying Hou, Austin Henley, Barbara Jane Ericson, David Weintrop, and Tovi Grossman. 2024. How Novices Use LLM-based Code Gen- erators to Solve CS1 Coding Tasks in a Self-Paced Learning Environment. In Proceedings of the 23rd Koli Calling International Conference on Computing Ed- ucation Research(Koli, Finland)(Koli Calling ’23). Asso...
-
[13]
Chris Kerslake, Paul Denny, IV Smith, David H., James Prather, Juho Leinonen, Andrew Luxton-Reilly, and Stephen MacNeil. 2024. Integrating Natural Language Prompting Tasks in Introductory Programming Courses. InProceedings of the 2024 on ACM Virtual Global Computing Education Conference V. 1(Virtual Event, NC, USA)(SIGCSE Virtual 2024). Association for Co...
-
[14]
Amy J. Ko, Robin Abraham, Laura Beckwith, Alan Blackwell, Margaret Burnett, Martin Erwig, Chris Scaffidi, Joseph Lawrance, Henry Lieberman, Brad Myers, Mary Beth Rosson, Gregg Rothermel, Mary Shaw, and Susan Wiedenbeck. 2011. The state of the art in end-user software engineering.ACM Comput. Surv.43, 3, Article 21 (April 2011), 44 pages. doi:10.1145/192264...
-
[15]
Liang, Melissa Lin, Nikitha Rao, and Brad A
Jenny T. Liang, Melissa Lin, Nikitha Rao, and Brad A. Myers. 2025. Prompts Are Programs Too! Understanding How Developers Build Software Containing Prompts.Proc. ACM Softw. Eng.2, FSE, Article FSE072 (June 2025), 24 pages. doi:10.1145/3729342
-
[16]
Leo S Lo. 2023. The CLEAR path: A framework for enhancing information literacy through prompt engineering.The Journal of Academic Librarianship49, 4 (2023), 102720
2023
-
[17]
Francesca Lucchetti, Zixuan Wu, Arjun Guha, Molly Q Feldman, and Carolyn Jane Anderson. 2025. Substance Beats Style: Why Beginning Students Fail to Code with LLMs. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 8541–8610
2025
-
[18]
Qianou Ma, Weirui Peng, Chenyang Yang, Hua Shen, Kenneth Koedinger, and Tongshuang Wu. 2025. What Should We Engineer in Prompts? Training Humans in Requirement-Driven LLM Use.ACM Trans. Comput.-Hum. Interact.(April 2025). doi:10.1145/3731756 Just Accepted
-
[20]
Reeves, Jaromir Savelka, IV Smith, David H., Sven Strickroth, and Daniel Zingaro
James Prather, Juho Leinonen, Natalie Kiesler, Jamie Gorson Benario, Sam Lau, Stephen MacNeil, Narges Norouzi, Simone Opel, Vee Pettit, Leo Porter, Brent N. Reeves, Jaromir Savelka, IV Smith, David H., Sven Strickroth, and Daniel Zingaro
-
[21]
Beyond the Hype: A Comprehensive Review of Current Trends in Genera- tive AI Research, Teaching Practices, and Tools. In2024 Working Group Reports on Innovation and Technology in Computer Science Education(Milan, Italy)(ITiCSE 2024). Association for Computing Machinery, New York, NY, USA, 300–338. doi:10.1145/3689187.3709614
-
[22]
Vennila Ramalingam, Deborah LaBelle, and Susan Wiedenbeck. 2004. Self-efficacy and mental models in learning to program. InProceedings of the 9th Annual SIGCSE Conference on Innovation and Technology in Computer Science Education(Leeds, United Kingdom)(ITiCSE ’04). Association for Computing Machinery, New York, NY, USA, 171–175. doi:10.1145/1007996.1008042
-
[23]
Advait Sarkar. 2023. Will Code Remain a Relevant User Interface for End-User Pro- gramming with Generative AI Models?. InProceedings of the 2023 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Pro- gramming and Software(Cascais, Portugal)(Onward! 2023). Association for Com- puting Machinery, New York, NY, USA, 153–167. ...
-
[24]
Hari Subramonyam, Roy Pea, Christopher Pondoc, Maneesh Agrawala, and Colleen Seifert. 2024. Bridging the Gulf of Envisioning: Cognitive Challenges in Prompt Based Interactions with LLMs. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Articl...
-
[25]
Smith IV, Mounika Padala, Christine Alvarado, Jamie Gorson Benario, and Leo Porter
Annapurna Vadaparty, Daniel Zingaro, David H. Smith IV, Mounika Padala, Christine Alvarado, Jamie Gorson Benario, and Leo Porter. 2024. CS1-LLM: Integrating LLMs into CS1 Instruction. InProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1(Milan, Italy)(ITiCSE 2024). Association for Computing Machinery, New York, NY, USA,...
-
[26]
April Y. Wang, Ryan Mitts, Philip J. Guo, and Parmit K. Chilana. 2018. Mismatch of Expectations: How Modern Learning Resources Fail Conversational Programmers. InProceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada)(CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–13. doi:10.1145/3173574.3174085
-
[27]
Ronald L Wasserstein and Nicole A Lazar. 2016. The ASA statement on p-values: context, process, and purpose. 129–133 pages
2016
-
[28]
Ronald L Wasserstein, Allen L Schirm, and Nicole A Lazar. 2019. Moving to a world beyond “p< 0.05”. 19 pages
2019
-
[29]
2005.Understanding by design
Grant P Wiggins and Jay McTighe. 2005.Understanding by design. Ascd
2005
-
[30]
Bodo Winter. 2013. Linear models and linear mixed effects models in R with linguistic applications. arXiv:1308.5499 [cs.CL] https://arxiv.org/abs/1308.5499
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[31]
Tongshuang Wu, Ellen Jiang, Aaron Donsbach, Jeff Gray, Alejandra Molina, Michael Terry, and Carrie J Cai. 2022. PromptChainer: Chaining Large Language Model Prompts through Visual Programming. InExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems(New Orleans, LA, USA) (CHI EA ’22). Association for Computing Machinery, New Y...
-
[32]
Zamfirescu-Pereira, Richmond Y
J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang
-
[33]
Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 437, 21 pages. doi:10.1145/3544548. 3581388
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.