pith. sign in

arxiv: 2604.20803 · v1 · submitted 2026-04-22 · 💻 cs.SE

Autonomous LLM-generated Feedback for Student Exercises in Introductory Software Engineering Courses

Pith reviewed 2026-05-09 23:31 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLM feedbacksoftware engineering educationautonomous feedbackstudent engagementacademic performanceuser acceptancegenerative AIempirical study
0
0 comments X

The pith

NAILA uses large language models to deliver 24/7 feedback for student exercises in introductory software engineering courses by comparing submissions to model solutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Introductory software engineering courses often have high enrollments that make it hard for instructors to give timely and personalized feedback to every student. The paper introduces NAILA, a tool that employs modern large language models to automatically process student solutions in open document formats. NAILA evaluates these solutions against teacher-defined model solutions using specialized prompt templates to generate the feedback. Researchers then studied its use among more than 900 students to understand motivations for adoption or rejection, perceived usefulness and ease of use, engagement patterns, and any influence on academic performance relative to human feedback. This approach could help address the challenge of scaling quality feedback as class sizes grow and generative AI tools become widespread in education.

Core claim

The central discovery is that NAILA can autonomously generate feedback for student exercises in introductory software engineering by processing open-format documents against teacher model solutions through specialized LLM prompt templates. The accompanying empirical study with over 900 active students investigates the motivations driving students to adopt or reject the tool, measures user acceptance via perceived usefulness, ease of use, and subjective learning progress, tracks engagement frequency and consistency, and compares the impact of AI feedback on academic performance to that of human feedback.

What carries the argument

NAILA, an autonomous feedback system that applies large language models to evaluate student submissions in open document formats against predefined model solutions using specialized prompt templates.

If this is right

  • Teachers can provide continuous feedback without being limited by time or staff availability.
  • Students receive immediate evaluations of their work at any hour.
  • Direct comparisons become possible between the effects of AI-generated and human feedback on student outcomes.
  • Data on student motivations and engagement can guide improvements in educational tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If validated, the method could be adapted for feedback in other technical courses with similar exercise formats.
  • Long-term use might reveal whether repeated AI feedback strengthens or weakens students' independent problem-solving skills.
  • Hybrid models combining AI for routine checks with human review for complex cases could emerge as a practical extension.

Load-bearing premise

The LLM outputs from the specialized prompt templates will be accurate enough, unbiased, and pedagogically appropriate to serve as reliable feedback without misleading students or introducing systematic errors.

What would settle it

Observing that students relying on NAILA feedback achieve lower exam scores or exhibit more misconceptions in their work than those receiving equivalent human feedback would falsify the tool's value as a substitute or supplement.

Figures

Figures reproduced from arXiv: 2604.20803 by Andreas Metzger.

Figure 1
Figure 1. Figure 1: Total number of participants in UDE’s introductory [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Degree programs in UDE’s introductory SE course [PITH_FULL_IMAGE:figures/full_fig_p001_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Students AI experience levels in UDE’s introductory [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Conceptual Architecture and Data Flow of [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of an Exercise Question The prompt for the above example is shown in [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt for Example Exercise Question a student participated in the introductory test at the beginning of the semester. The use of NAILA was not mandatory, so the students could decide for themselves whether they used AI-based feedback for their exercises or not. Out of the 843 active students, 314 (37%) opted to use NAILA, i.e., we counted them if they used NAILA at least once [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 7
Figure 7. Figure 7: Summary of RQ2 Results (N = 68) We quantify NAILA usage of (NU) of a student by introduc￾ing two distinct metrics. NUC (Confirm) captures how well students performed when deciding they don’t need more than one round of AI feedback for an exercise. NUR (Remedy) captures how much students’ exercise performance improves when they repetitively use AI feedback for an exercise. These metrics are computed as foll… view at source ↗
Figure 8
Figure 8. Figure 8: Breakdown of RQ2 Results (N = 68) matory” feedback (NUC), 20 (6%) always used ”remedial” feedback (NUR), and the remaining 155 (49%) used both [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Histograms of Students’ NAILA usage (NNUC = 294, NNUR = 175) 3.5 RQ4: How does NAILA affect learning? RQ4 Setup: To achieve a quantitative assessment of how NAILA may affect learning, we measured the students’ per￾formance in the exam taken at the end of the semester (SP). The exam took 60 minutes and consisted of multiple-choice questions (ca. 30%) and free-text questions. As the use of NAILA was optional… view at source ↗
Figure 10
Figure 10. Figure 10: Histogram of Students’ Performance (SP) (N = 670, SP¯ = 70%, SPmin = 13%, SPmax = 99%) [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Boxplots of Students’ Performance (SP) (N = 760) Level WE EM 0 39 5,8% 199 29,7% 1 114 17,0% 200 29,9% 2 176 26,3% 158 23,6% 3 341 50,9% 113 16,9% TABLE IV: Distribution by Level (N = 670) represented by the volume of Written Exercises completed (WE) with a β2 = 6.58***. Crucially, regarding the effectiveness of NAILA, the feed￾back metric (NUC) emerged as a significant positive predictor of final perform… view at source ↗
read the original abstract

Introductory Software Engineering (SE) courses face rapidly increasing student enrollment numbers, participants with diverse backgrounds and the influence of Generative AI (GenAI) solutions. High teacher-to-student ratios often challenge providing timely, high-quality, and personalized feedback a significant challenge for educators. To address these challenges, we introduce NAILA, a tool that provides 24/7 autonomous feedback for student exercises. Utilizing GenAI in the form of modern LLMs, NAILA processes student solutions provided in open document formats, evaluating them against teacher-defined model solutions through specialized prompt templates. We conducted an empirical study involving 900+ active students at the University of Duisburg-Essen to assess four main research questions investigating (1) the underlying motivations that drive students to either adopt or reject NAILA, (2) user acceptance by measuring perceived usefulness and ease of use alongside subjective learning progress, (3) how often and how consistently students engage with NAILA, and (4) how using NAILA to receive AI feedback impacts on academic performance compared to human feedback.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces NAILA, a tool that uses LLMs to deliver 24/7 autonomous feedback on student exercises in introductory software engineering courses. Student submissions in open document formats are evaluated against teacher-defined model solutions via specialized prompt templates. The authors describe an empirical study with 900+ active students at the University of Duisburg-Essen designed to address four research questions on adoption motivations, user acceptance (usefulness, ease of use, perceived learning progress), engagement frequency/consistency, and academic performance effects relative to human feedback.

Significance. If the missing empirical results were supplied and demonstrated reliable, unbiased feedback that improves engagement or performance without systematic errors, the work could meaningfully address scalability challenges in large-enrollment SE courses by showing how GenAI can supplement human feedback. The tool architecture and study design are clearly motivated by real enrollment pressures and GenAI availability, but the absence of any data, analyses, or findings on the four RQs prevents assessment of actual impact or pedagogical value.

major comments (2)
  1. [Abstract] Abstract and introduction: The manuscript states that an empirical study with 900+ students was conducted to assess the four research questions on motivations, acceptance, engagement, and performance impact, yet no results section, tables, figures, statistical comparisons, or qualitative findings are present. This is load-bearing because the central claim is that the study evaluates NAILA's effects versus human feedback; without the outcomes the assessment cannot be evaluated.
  2. [Introduction / Tool Architecture] Study design description: The weakest assumption—that LLM outputs from the specialized prompt templates will be accurate, unbiased, and pedagogically valuable enough to serve as feedback—is not tested or bounded in the reported material. No error analysis, inter-rater agreement with human graders, or failure-case examples are supplied to support the claim that the tool can be deployed without misleading students.
minor comments (1)
  1. [Tool Description] Notation for the prompt templates and document processing pipeline could be clarified with a diagram or pseudocode to make the architecture reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. The points raised correctly identify gaps in the current version, and we will revise the paper to address them fully.

read point-by-point responses
  1. Referee: [Abstract] Abstract and introduction: The manuscript states that an empirical study with 900+ students was conducted to assess the four research questions on motivations, acceptance, engagement, and performance impact, yet no results section, tables, figures, statistical comparisons, or qualitative findings are present. This is load-bearing because the central claim is that the study evaluates NAILA's effects versus human feedback; without the outcomes the assessment cannot be evaluated.

    Authors: We agree that the absence of the empirical results is a critical omission. The study with 900+ students was conducted and data were collected for all four research questions, but these findings were not included in the submitted manuscript. In the revised version we will add a complete Results section containing quantitative and qualitative analyses, tables, figures, statistical comparisons (including performance effects relative to human feedback), and student acceptance metrics to enable full evaluation of the claims. revision: yes

  2. Referee: [Introduction / Tool Architecture] Study design description: The weakest assumption—that LLM outputs from the specialized prompt templates will be accurate, unbiased, and pedagogically valuable enough to serve as feedback—is not tested or bounded in the reported material. No error analysis, inter-rater agreement with human graders, or failure-case examples are supplied to support the claim that the tool can be deployed without misleading students.

    Authors: The referee is correct that the manuscript does not provide empirical bounds on feedback quality. We will add a dedicated subsection on LLM feedback validation. This will report an error analysis performed on a sample of outputs, inter-rater agreement statistics (e.g., Cohen’s kappa) between NAILA feedback and human expert ratings, and representative failure cases together with the prompt-engineering mitigations used. These additions will directly address the concern about potential misleading feedback. revision: yes

Circularity Check

0 steps flagged

No circularity in tool description and empirical study outline

full rationale

The paper introduces NAILA as a tool using LLMs and specialized prompts to provide feedback on student solutions against model answers, then outlines an empirical study with 900+ students to assess four research questions on motivations, acceptance, engagement, and performance impacts versus human feedback. No equations, derivations, fitted parameters, predictions, or self-citations appear in the provided text. The central claims are descriptive and empirical rather than derived from first principles, so no step reduces to its inputs by construction. The work is self-contained as a system description plus study design.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested premise that prompt-engineered LLMs can produce educationally sound feedback; the only invented entity is the NAILA tool itself.

axioms (1)
  • domain assumption Modern LLMs can reliably evaluate open-ended student solutions in software engineering against teacher-provided model solutions when guided by specialized prompt templates
    This assumption underpins both the tool's design and any claim of usefulness versus human feedback.
invented entities (1)
  • NAILA no independent evidence
    purpose: Autonomous LLM-based feedback system for student exercises
    Newly introduced tool whose value depends on the domain assumption above.

pith-pipeline@v0.9.0 · 5476 in / 1293 out tokens · 52414 ms · 2026-05-09T23:31:23.247535+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    U. Z. Ahmed, S. Sahai, B. Leong, and A. Karkare. Feasibility study of augmenting teaching assistants with AI for CS1 programming feedback. In J. A. Stone, T. T. Yuen, L. Shoop, S. A. Rebelsky, and J. Prather, editors, Proceedings of the 56th ACM Technical Symposium on Computer Science Education V . 1, SIGCSE TS 2025, Pittsburgh, PA, USA, 26 February 2025 ...

  2. [2]

    F. An, L. Xi, and J. Yu. The relationship between technol- ogy acceptance and self-regulated learning: the mediation roles of intrinsic motivation and learning engagement. Education and Information Technologies, 29(3):2605– 2623, 2024

  3. [3]

    Bassner, B

    P. Bassner, B. Lenk-Ostendorf, R. Beinstingel, T. Was- ner, and S. Krusche. Less stress, better scores, same learning: The dissociation of performance and learning in ai-supported programming education.Computers and Education: Artificial Intelligence, page 100537, 2025

  4. [4]

    Choi and E

    D. Choi and E. Lee. Automated feedback generation for programming assignments through diversification. In 37th IEEE/ACM International Conference on Software Engineering Education and Training, CSEE&T 2025, Ottawa, ON, Canada, April 27 - May 3, 2025, pages 230–241. IEEE, 2025

  5. [5]

    S. Datta. Using generative artificial intelligence tools in software engineering courses. In36th International Con- ference on Software Engineering Education and Training, CSEE&T 2024, W ¨urzburg, Germany, July 29 - Aug. 1, 2024, pages 1–2. IEEE, 2024

  6. [6]

    M. Daun, J. Brings, V . Trzpiot, and P. A. Obe. Learner preferences in software engineering education: A com- parative study of similarities and differences between university students and industry professionals. In37th IEEE/ACM International Conference on Software Engi- neering Education and Training, CSEE&T 2025, Ottawa, ON, Canada, April 27 - May 3, 20...

  7. [7]

    F. D. Davis. Perceived usefulness, perceived ease of use, and user acceptance of information technology.MIS quarterly, pages 319–340, 1989

  8. [8]

    F. D. Davis and V . Venkatesh. Toward preprototype user acceptance testing of new information systems: implications for software project management.IEEE Transactions on Engineering management, 51(1):31–46, 2004

  9. [9]

    Dick and C

    S. Dick and C. Bockisch. MASS. marburg university auto asess system. In36th International Conference on Software Engineering Education and Training, CSEE&T 2024, W¨urzburg, Germany, July 29 - Aug. 1, 2024, pages 1–2. IEEE, 2024

  10. [10]

    Z. Fan, Y . Noller, A. Dandekar, and A. Roychoudhury. Software engineering educational experience in building an intelligent tutoring system. In37th IEEE/ACM Inter- national Conference on Software Engineering Education and Training, CSEE&T 2025, Ottawa, ON, Canada, April 27 - May 3, 2025, pages 75–86. IEEE, 2025

  11. [11]

    Grandel, D

    S. Grandel, D. C. Schmidt, and K. Leach. Applying large language models to enhance the assessment of java programming assignments. In L. Montecchi, J. Li, D. Poshyvanyk, and D. Zhang, editors,Proceedings of the 33rd ACM International Conference on the Founda- tions of Software Engineering, FSE Companion 2025, Clarion Hotel Trondheim, Trondheim, Norway, Ju...

  12. [12]

    A. F. Hadwin, R. Rostampour, and P. H. Winne. Ad- vancing self-reports of self-regulated learning: Validating new measures to assess students’ beliefs, practices, and challenges.Educational Psychology Review, 37(1):8, 2025

  13. [13]

    Jacobs and S

    S. Jacobs and S. Jaschke. Leveraging lecture content for improved feedback: Explorations with GPT-4 and retrieval augmented generation. In36th International Conference on Software Engineering Education and Training, CSEE&T 2024, W ¨urzburg, Germany, July 29 - Aug. 1, 2024, pages 1–5. IEEE, 2024

  14. [14]

    W. R. King and J. He. A meta-analysis of the tech- nology acceptance model.Information & management, 43(6):740–755, 2006

  15. [15]

    Koutcheme, N

    C. Koutcheme, N. Dainese, S. Sarsa, A. Hellas, J. Leinonen, S. Ashraf, and P. Denny. Evaluating lan- guage models for generating and judging programming feedback. In J. A. Stone, T. T. Yuen, L. Shoop, S. A. Rebelsky, and J. Prather, editors,Proceedings of the 56th ACM Technical Symposium on Computer Science Education V . 1, SIGCSE TS 2025, Pittsburgh, PA,...

  16. [16]

    Laitenberger and H

    O. Laitenberger and H. M. Dreyer. Evaluating the usefulness and the ease of use of a web-based inspec- tion data collection tool. InProceedings Fifth Interna- tional Software Metrics Symposium. Metrics (Cat. No. 98TB100262), pages 122–132. IEEE, 1998

  17. [17]

    Y . Liao, Y . Jiang, Z. Chen, and B. Suleiman. Feedback- pulse: Gpt-enabled feedback assistant for software engi- neering educators. In36th International Conference on Software Engineering Education and Training, CSEE&T 2024, W¨urzburg, Germany, July 29 - Aug. 1, 2024, pages 1–2. IEEE, 2024

  18. [18]

    Liao, W.-Y

    Y .-K. Liao, W.-Y . Wu, T. Q. Le, and T. T. T. Phung. The integration of the technology acceptance model and value-based adoption model to study the adoption of e- learning: The moderating role of e-wom.Sustainability, 14(2):815, 2022

  19. [19]

    N. R. Mead. AI and software engineering education: Riding the wave of innovation. In36th International Conference on Software Engineering Education and Training, CSEE&T 2024, W ¨urzburg, Germany, July 29 - Aug. 1, 2024, page 1. IEEE, 2024

  20. [20]

    M. G. Morris and A. Dillon. How user perceptions influence software use.IEEE software, 14(4):58–65, 1997

  21. [21]

    Raihan, M

    N. Raihan, M. L. Siddiq, J. C. S. Santos, and M. Zampieri. Large language models in computer science education: A systematic literature review. In J. A. Stone, T. T. Yuen, L. Shoop, S. A. Rebelsky, and J. Prather, edi- tors,Proceedings of the 56th ACM Technical Symposium on Computer Science Education V . 1, SIGCSE TS 2025, Pittsburgh, PA, USA, 26 February...

  22. [22]

    C. K. Sah, X. Lian, M. M. Islam, and M. K. Islam. Navigating the AI frontier: A critical literature review on integrating artificial intelligence into software engi- neering education. In36th International Conference on Software Engineering Education and Training, CSEE&T 2024, W¨urzburg, Germany, July 29 - Aug. 1, 2024, pages 1–5. IEEE, 2024

  23. [23]

    S ¨olch, F

    M. S ¨olch, F. T. J. Dietrich, and S. Krusche. Direct automated feedback delivery for student submissions based on llms. In L. Montecchi, J. Li, D. Poshyvanyk, and D. Zhang, editors,Proceedings of the 33rd ACM International Conference on the Foundations of Soft- ware Engineering, FSE Companion 2025, Clarion Hotel Trondheim, Trondheim, Norway, June 23-28, ...

  24. [24]

    Suleiman, M

    B. Suleiman, M. J. Alibasa, and A. Wang. Automated assessment tool for teaching web application develop- men. In36th International Conference on Software Engineering Education and Training, CSEE&T 2024, W¨urzburg, Germany, July 29 - Aug. 1, 2024, pages 1–

  25. [25]

    Vierhauser, I

    M. Vierhauser, I. Groher, T. Antensteiner, and C. Sauer- wein. Towards integrating emerging AI applications in SE education. In36th International Conference on Software Engineering Education and Training, CSEE&T 2024, W¨urzburg, Germany, July 29 - Aug. 1, 2024, pages 1–5. IEEE, 2024

  26. [26]

    K. S. Wang and R. Lawrence. Quantitative evaluation of using large language models and retrieval-augmented generation in computer science education. In J. A. Stone, T. T. Yuen, L. Shoop, S. A. Rebelsky, and J. Prather, edi- tors,Proceedings of the 56th ACM Technical Symposium on Computer Science Education V . 1, SIGCSE TS 2025, Pittsburgh, PA, USA, 26 Feb...