Autonomous LLM-generated Feedback for Student Exercises in Introductory Software Engineering Courses
Pith reviewed 2026-05-09 23:31 UTC · model grok-4.3
The pith
NAILA uses large language models to deliver 24/7 feedback for student exercises in introductory software engineering courses by comparing submissions to model solutions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that NAILA can autonomously generate feedback for student exercises in introductory software engineering by processing open-format documents against teacher model solutions through specialized LLM prompt templates. The accompanying empirical study with over 900 active students investigates the motivations driving students to adopt or reject the tool, measures user acceptance via perceived usefulness, ease of use, and subjective learning progress, tracks engagement frequency and consistency, and compares the impact of AI feedback on academic performance to that of human feedback.
What carries the argument
NAILA, an autonomous feedback system that applies large language models to evaluate student submissions in open document formats against predefined model solutions using specialized prompt templates.
If this is right
- Teachers can provide continuous feedback without being limited by time or staff availability.
- Students receive immediate evaluations of their work at any hour.
- Direct comparisons become possible between the effects of AI-generated and human feedback on student outcomes.
- Data on student motivations and engagement can guide improvements in educational tools.
Where Pith is reading between the lines
- If validated, the method could be adapted for feedback in other technical courses with similar exercise formats.
- Long-term use might reveal whether repeated AI feedback strengthens or weakens students' independent problem-solving skills.
- Hybrid models combining AI for routine checks with human review for complex cases could emerge as a practical extension.
Load-bearing premise
The LLM outputs from the specialized prompt templates will be accurate enough, unbiased, and pedagogically appropriate to serve as reliable feedback without misleading students or introducing systematic errors.
What would settle it
Observing that students relying on NAILA feedback achieve lower exam scores or exhibit more misconceptions in their work than those receiving equivalent human feedback would falsify the tool's value as a substitute or supplement.
Figures
read the original abstract
Introductory Software Engineering (SE) courses face rapidly increasing student enrollment numbers, participants with diverse backgrounds and the influence of Generative AI (GenAI) solutions. High teacher-to-student ratios often challenge providing timely, high-quality, and personalized feedback a significant challenge for educators. To address these challenges, we introduce NAILA, a tool that provides 24/7 autonomous feedback for student exercises. Utilizing GenAI in the form of modern LLMs, NAILA processes student solutions provided in open document formats, evaluating them against teacher-defined model solutions through specialized prompt templates. We conducted an empirical study involving 900+ active students at the University of Duisburg-Essen to assess four main research questions investigating (1) the underlying motivations that drive students to either adopt or reject NAILA, (2) user acceptance by measuring perceived usefulness and ease of use alongside subjective learning progress, (3) how often and how consistently students engage with NAILA, and (4) how using NAILA to receive AI feedback impacts on academic performance compared to human feedback.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces NAILA, a tool that uses LLMs to deliver 24/7 autonomous feedback on student exercises in introductory software engineering courses. Student submissions in open document formats are evaluated against teacher-defined model solutions via specialized prompt templates. The authors describe an empirical study with 900+ active students at the University of Duisburg-Essen designed to address four research questions on adoption motivations, user acceptance (usefulness, ease of use, perceived learning progress), engagement frequency/consistency, and academic performance effects relative to human feedback.
Significance. If the missing empirical results were supplied and demonstrated reliable, unbiased feedback that improves engagement or performance without systematic errors, the work could meaningfully address scalability challenges in large-enrollment SE courses by showing how GenAI can supplement human feedback. The tool architecture and study design are clearly motivated by real enrollment pressures and GenAI availability, but the absence of any data, analyses, or findings on the four RQs prevents assessment of actual impact or pedagogical value.
major comments (2)
- [Abstract] Abstract and introduction: The manuscript states that an empirical study with 900+ students was conducted to assess the four research questions on motivations, acceptance, engagement, and performance impact, yet no results section, tables, figures, statistical comparisons, or qualitative findings are present. This is load-bearing because the central claim is that the study evaluates NAILA's effects versus human feedback; without the outcomes the assessment cannot be evaluated.
- [Introduction / Tool Architecture] Study design description: The weakest assumption—that LLM outputs from the specialized prompt templates will be accurate, unbiased, and pedagogically valuable enough to serve as feedback—is not tested or bounded in the reported material. No error analysis, inter-rater agreement with human graders, or failure-case examples are supplied to support the claim that the tool can be deployed without misleading students.
minor comments (1)
- [Tool Description] Notation for the prompt templates and document processing pipeline could be clarified with a diagram or pseudocode to make the architecture reproducible.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. The points raised correctly identify gaps in the current version, and we will revise the paper to address them fully.
read point-by-point responses
-
Referee: [Abstract] Abstract and introduction: The manuscript states that an empirical study with 900+ students was conducted to assess the four research questions on motivations, acceptance, engagement, and performance impact, yet no results section, tables, figures, statistical comparisons, or qualitative findings are present. This is load-bearing because the central claim is that the study evaluates NAILA's effects versus human feedback; without the outcomes the assessment cannot be evaluated.
Authors: We agree that the absence of the empirical results is a critical omission. The study with 900+ students was conducted and data were collected for all four research questions, but these findings were not included in the submitted manuscript. In the revised version we will add a complete Results section containing quantitative and qualitative analyses, tables, figures, statistical comparisons (including performance effects relative to human feedback), and student acceptance metrics to enable full evaluation of the claims. revision: yes
-
Referee: [Introduction / Tool Architecture] Study design description: The weakest assumption—that LLM outputs from the specialized prompt templates will be accurate, unbiased, and pedagogically valuable enough to serve as feedback—is not tested or bounded in the reported material. No error analysis, inter-rater agreement with human graders, or failure-case examples are supplied to support the claim that the tool can be deployed without misleading students.
Authors: The referee is correct that the manuscript does not provide empirical bounds on feedback quality. We will add a dedicated subsection on LLM feedback validation. This will report an error analysis performed on a sample of outputs, inter-rater agreement statistics (e.g., Cohen’s kappa) between NAILA feedback and human expert ratings, and representative failure cases together with the prompt-engineering mitigations used. These additions will directly address the concern about potential misleading feedback. revision: yes
Circularity Check
No circularity in tool description and empirical study outline
full rationale
The paper introduces NAILA as a tool using LLMs and specialized prompts to provide feedback on student solutions against model answers, then outlines an empirical study with 900+ students to assess four research questions on motivations, acceptance, engagement, and performance impacts versus human feedback. No equations, derivations, fitted parameters, predictions, or self-citations appear in the provided text. The central claims are descriptive and empirical rather than derived from first principles, so no step reduces to its inputs by construction. The work is self-contained as a system description plus study design.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Modern LLMs can reliably evaluate open-ended student solutions in software engineering against teacher-provided model solutions when guided by specialized prompt templates
invented entities (1)
-
NAILA
no independent evidence
Reference graph
Works this paper leans on
-
[1]
U. Z. Ahmed, S. Sahai, B. Leong, and A. Karkare. Feasibility study of augmenting teaching assistants with AI for CS1 programming feedback. In J. A. Stone, T. T. Yuen, L. Shoop, S. A. Rebelsky, and J. Prather, editors, Proceedings of the 56th ACM Technical Symposium on Computer Science Education V . 1, SIGCSE TS 2025, Pittsburgh, PA, USA, 26 February 2025 ...
work page 2025
-
[2]
F. An, L. Xi, and J. Yu. The relationship between technol- ogy acceptance and self-regulated learning: the mediation roles of intrinsic motivation and learning engagement. Education and Information Technologies, 29(3):2605– 2623, 2024
work page 2024
-
[3]
P. Bassner, B. Lenk-Ostendorf, R. Beinstingel, T. Was- ner, and S. Krusche. Less stress, better scores, same learning: The dissociation of performance and learning in ai-supported programming education.Computers and Education: Artificial Intelligence, page 100537, 2025
work page 2025
-
[4]
D. Choi and E. Lee. Automated feedback generation for programming assignments through diversification. In 37th IEEE/ACM International Conference on Software Engineering Education and Training, CSEE&T 2025, Ottawa, ON, Canada, April 27 - May 3, 2025, pages 230–241. IEEE, 2025
work page 2025
-
[5]
S. Datta. Using generative artificial intelligence tools in software engineering courses. In36th International Con- ference on Software Engineering Education and Training, CSEE&T 2024, W ¨urzburg, Germany, July 29 - Aug. 1, 2024, pages 1–2. IEEE, 2024
work page 2024
-
[6]
M. Daun, J. Brings, V . Trzpiot, and P. A. Obe. Learner preferences in software engineering education: A com- parative study of similarities and differences between university students and industry professionals. In37th IEEE/ACM International Conference on Software Engi- neering Education and Training, CSEE&T 2025, Ottawa, ON, Canada, April 27 - May 3, 20...
work page 2025
-
[7]
F. D. Davis. Perceived usefulness, perceived ease of use, and user acceptance of information technology.MIS quarterly, pages 319–340, 1989
work page 1989
-
[8]
F. D. Davis and V . Venkatesh. Toward preprototype user acceptance testing of new information systems: implications for software project management.IEEE Transactions on Engineering management, 51(1):31–46, 2004
work page 2004
-
[9]
S. Dick and C. Bockisch. MASS. marburg university auto asess system. In36th International Conference on Software Engineering Education and Training, CSEE&T 2024, W¨urzburg, Germany, July 29 - Aug. 1, 2024, pages 1–2. IEEE, 2024
work page 2024
-
[10]
Z. Fan, Y . Noller, A. Dandekar, and A. Roychoudhury. Software engineering educational experience in building an intelligent tutoring system. In37th IEEE/ACM Inter- national Conference on Software Engineering Education and Training, CSEE&T 2025, Ottawa, ON, Canada, April 27 - May 3, 2025, pages 75–86. IEEE, 2025
work page 2025
-
[11]
S. Grandel, D. C. Schmidt, and K. Leach. Applying large language models to enhance the assessment of java programming assignments. In L. Montecchi, J. Li, D. Poshyvanyk, and D. Zhang, editors,Proceedings of the 33rd ACM International Conference on the Founda- tions of Software Engineering, FSE Companion 2025, Clarion Hotel Trondheim, Trondheim, Norway, Ju...
work page 2025
-
[12]
A. F. Hadwin, R. Rostampour, and P. H. Winne. Ad- vancing self-reports of self-regulated learning: Validating new measures to assess students’ beliefs, practices, and challenges.Educational Psychology Review, 37(1):8, 2025
work page 2025
-
[13]
S. Jacobs and S. Jaschke. Leveraging lecture content for improved feedback: Explorations with GPT-4 and retrieval augmented generation. In36th International Conference on Software Engineering Education and Training, CSEE&T 2024, W ¨urzburg, Germany, July 29 - Aug. 1, 2024, pages 1–5. IEEE, 2024
work page 2024
-
[14]
W. R. King and J. He. A meta-analysis of the tech- nology acceptance model.Information & management, 43(6):740–755, 2006
work page 2006
-
[15]
C. Koutcheme, N. Dainese, S. Sarsa, A. Hellas, J. Leinonen, S. Ashraf, and P. Denny. Evaluating lan- guage models for generating and judging programming feedback. In J. A. Stone, T. T. Yuen, L. Shoop, S. A. Rebelsky, and J. Prather, editors,Proceedings of the 56th ACM Technical Symposium on Computer Science Education V . 1, SIGCSE TS 2025, Pittsburgh, PA,...
work page 2025
-
[16]
O. Laitenberger and H. M. Dreyer. Evaluating the usefulness and the ease of use of a web-based inspec- tion data collection tool. InProceedings Fifth Interna- tional Software Metrics Symposium. Metrics (Cat. No. 98TB100262), pages 122–132. IEEE, 1998
work page 1998
-
[17]
Y . Liao, Y . Jiang, Z. Chen, and B. Suleiman. Feedback- pulse: Gpt-enabled feedback assistant for software engi- neering educators. In36th International Conference on Software Engineering Education and Training, CSEE&T 2024, W¨urzburg, Germany, July 29 - Aug. 1, 2024, pages 1–2. IEEE, 2024
work page 2024
-
[18]
Y .-K. Liao, W.-Y . Wu, T. Q. Le, and T. T. T. Phung. The integration of the technology acceptance model and value-based adoption model to study the adoption of e- learning: The moderating role of e-wom.Sustainability, 14(2):815, 2022
work page 2022
-
[19]
N. R. Mead. AI and software engineering education: Riding the wave of innovation. In36th International Conference on Software Engineering Education and Training, CSEE&T 2024, W ¨urzburg, Germany, July 29 - Aug. 1, 2024, page 1. IEEE, 2024
work page 2024
-
[20]
M. G. Morris and A. Dillon. How user perceptions influence software use.IEEE software, 14(4):58–65, 1997
work page 1997
-
[21]
N. Raihan, M. L. Siddiq, J. C. S. Santos, and M. Zampieri. Large language models in computer science education: A systematic literature review. In J. A. Stone, T. T. Yuen, L. Shoop, S. A. Rebelsky, and J. Prather, edi- tors,Proceedings of the 56th ACM Technical Symposium on Computer Science Education V . 1, SIGCSE TS 2025, Pittsburgh, PA, USA, 26 February...
work page 2025
-
[22]
C. K. Sah, X. Lian, M. M. Islam, and M. K. Islam. Navigating the AI frontier: A critical literature review on integrating artificial intelligence into software engi- neering education. In36th International Conference on Software Engineering Education and Training, CSEE&T 2024, W¨urzburg, Germany, July 29 - Aug. 1, 2024, pages 1–5. IEEE, 2024
work page 2024
-
[23]
M. S ¨olch, F. T. J. Dietrich, and S. Krusche. Direct automated feedback delivery for student submissions based on llms. In L. Montecchi, J. Li, D. Poshyvanyk, and D. Zhang, editors,Proceedings of the 33rd ACM International Conference on the Foundations of Soft- ware Engineering, FSE Companion 2025, Clarion Hotel Trondheim, Trondheim, Norway, June 23-28, ...
work page 2025
-
[24]
B. Suleiman, M. J. Alibasa, and A. Wang. Automated assessment tool for teaching web application develop- men. In36th International Conference on Software Engineering Education and Training, CSEE&T 2024, W¨urzburg, Germany, July 29 - Aug. 1, 2024, pages 1–
work page 2024
-
[25]
M. Vierhauser, I. Groher, T. Antensteiner, and C. Sauer- wein. Towards integrating emerging AI applications in SE education. In36th International Conference on Software Engineering Education and Training, CSEE&T 2024, W¨urzburg, Germany, July 29 - Aug. 1, 2024, pages 1–5. IEEE, 2024
work page 2024
-
[26]
K. S. Wang and R. Lawrence. Quantitative evaluation of using large language models and retrieval-augmented generation in computer science education. In J. A. Stone, T. T. Yuen, L. Shoop, S. A. Rebelsky, and J. Prather, edi- tors,Proceedings of the 56th ACM Technical Symposium on Computer Science Education V . 1, SIGCSE TS 2025, Pittsburgh, PA, USA, 26 Feb...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.