pith. sign in

arxiv: 2605.21614 · v1 · pith:VN3NTATYnew · submitted 2026-05-20 · 💻 cs.HC · cs.LG

Exploring the Effectiveness of Using LLMs for Automated Assessment of Student Self Explanations in Programming Education

Pith reviewed 2026-05-22 09:19 UTC · model grok-4.3

classification 💻 cs.HC cs.LG
keywords LLMself-explanationsautomated assessmentprogramming educationsemantic similaritybinary classificationworked examplesstudent responses
0
0 comments X

The pith

LLMs offer a more effective approach than semantic similarity for binary scoring of student self-explanations in programming.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether large language models outperform semantic similarity methods when automatically judging the correctness of student self-explanations of programming steps. It frames this assessment as a binary classification task and stresses the need for datasets that feature balanced correct/incorrect examples along with domain-specific labels. A sympathetic reader would care because self-explanations strengthen learning from worked examples, yet manual review does not scale to large classes, making reliable automation essential for wider use of this teaching technique. The work tests if recent LLM advances have displaced older similarity-based scoring.

Core claim

The paper conducts a rigorous comparison of LLMs versus semantic similarity methods for automated scoring of student self-explanations, framed as a binary classification task that requires quality datasets with balanced class distributions and domain-specific labels to produce meaningful results.

What carries the argument

Binary classification of each student self-explanation as correct or incorrect, performed either by prompting an LLM or by measuring semantic similarity to an expert reference explanation.

If this is right

  • Automated scoring can support wider adoption of self-explanation activities inside large programming courses.
  • LLM-based feedback becomes feasible for real-time use during worked-example study.
  • Future evaluations of scoring techniques will require datasets that maintain balanced classes and domain-specific labels.
  • Self-explanation combined with worked examples can be deployed at greater scale without proportional increases in instructor effort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar head-to-head tests could be run in other subjects such as mathematics or science to see whether LLMs retain their edge outside programming.
  • Hybrid scoring systems that combine LLM reasoning with similarity checks might improve robustness on edge cases.
  • Moving beyond binary labels to capture degrees of correctness or specific misconceptions could build on the same comparison method.

Load-bearing premise

Student self-explanations can be meaningfully assessed through a simple binary correct/incorrect classification, and suitable balanced domain-specific datasets exist for fair comparison.

What would settle it

A new balanced dataset of student self-explanations where semantic similarity methods achieve higher classification accuracy than LLMs would show that the LLM approach is not more effective.

Figures

Figures reproduced from arXiv: 2605.21614 by Arun-Balajiee Lekshmi-Narayanan, Mohammad Hassany, Peter Brusilovsky.

Figure 1
Figure 1. Figure 1: Prompting LLM to evaluate students’ explanations. “OR” notation is used in this to indicate only one of [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The AI-generated explanations remain similar [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Aritifical Generation to augment the original dataset with negative (incorrect) code examples [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Worked examples are step-by-step solutions to problems in a specific domain, offered to students to acquire domain-specific problem-solving skills. The effectiveness of worked examples could be enhanced by combining them with self-explanations, which ask students to explain rather than passively study each problem-solving step. The main challenge of this approach is assessing the correctness of the student's explanations. In the prevailing approach, student explanations are judged by their semantic similarity to an instructor's or domain expert's explanation. Given recent advances in LLM-based automated scoring, it remains unclear whether semantic similarity methods are still the most effective technique to automatically score textual student responses like essays or code explanations. Comparing these methods also requires quality datasets that offer distinctive features such as balanced class distributions and domain-specific labeled data for automated scoring tasks. In this paper, we present a rigorous comparison between LLMs and semantic similarity used for automated scoring, framed as a binary classification task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript explores the effectiveness of large language models (LLMs) versus semantic similarity methods for automated assessment of student self-explanations in programming education. It frames the scoring task as binary classification (correct/incorrect) and notes the requirement for quality datasets featuring balanced class distributions and domain-specific labels to enable such comparisons.

Significance. If executed with appropriate datasets and yielding clear performance differences, the comparison could inform practical choices for scalable automated scoring in programming courses. The work targets a real bottleneck in worked-example-based instruction. However, the absence of reported results, metrics, or dataset descriptions in the provided text limits assessment of whether any concrete advance is demonstrated.

major comments (2)
  1. [Abstract] Abstract: The text describes the planned comparison and binary classification framing but supplies no actual results, performance numbers, dataset details, or prompting methods. This prevents evaluation of whether the data or methods support the central claim of a rigorous head-to-head comparison.
  2. [Abstract] Abstract (binary classification framing): Self-explanations in programming frequently contain partial correctness, multiple valid aspects, or subtle misconceptions that resist clean binary labeling. Without reported inter-rater reliability, multi-label alternatives, or correlation to learning outcomes, any performance delta between LLMs and semantic similarity could be an artifact of label granularity rather than genuine methodological superiority.
minor comments (1)
  1. [Abstract] The abstract would benefit from a brief concrete example of a domain-specific labeled self-explanation to illustrate the balanced-class and domain-specific requirements mentioned.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript comparing LLMs and semantic similarity for automated assessment of student self-explanations in programming education. We address each major comment point by point below, indicating revisions where made to strengthen the paper.

read point-by-point responses
  1. Referee: The text describes the planned comparison and binary classification framing but supplies no actual results, performance numbers, dataset details, or prompting methods. This prevents evaluation of whether the data or methods support the central claim of a rigorous head-to-head comparison.

    Authors: The full manuscript includes a dedicated dataset section describing the balanced class distributions and domain-specific labels collected from programming education self-explanations, a methods section detailing the LLM prompting strategies, and a results section reporting concrete performance metrics (accuracy, F1, etc.) from the comparison. To make these elements more immediately visible, we have revised the abstract to summarize the dataset characteristics, prompting approach, and primary performance findings. revision: yes

  2. Referee: Self-explanations in programming frequently contain partial correctness, multiple valid aspects, or subtle misconceptions that resist clean binary labeling. Without reported inter-rater reliability, multi-label alternatives, or correlation to learning outcomes, any performance delta between LLMs and semantic similarity could be an artifact of label granularity rather than genuine methodological superiority.

    Authors: We acknowledge that binary labels cannot capture every nuance of student explanations. Our dataset labels were produced by domain experts using a protocol focused on whether the core conceptual element was correctly explained; we have now added the inter-rater reliability statistics and labeling guidelines to the methods section. We have also inserted a limitations paragraph discussing the binary framing, outlining multi-label alternatives for future work, and noting that direct correlation with learning outcomes lies outside the scope of this methodological comparison. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison without derivations or self-referential reductions

full rationale

This is a standard empirical study that compares LLMs against semantic similarity baselines for binary classification of student self-explanations using external datasets and models. The paper contains no equations, parameter-fitting steps, uniqueness theorems, or derivation chains that could reduce to their own inputs by construction. All performance claims rest on reported experimental results against held-out data rather than any self-definitional or self-citation load-bearing structure, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that binary classification captures explanation quality and on the existence of suitable balanced, domain-specific labeled datasets; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Binary classification is sufficient to represent correctness of student self-explanations
    The abstract frames the automated scoring task as binary classification without discussing limitations of this reduction.

pith-pipeline@v0.9.0 · 5699 in / 1201 out tokens · 36873 ms · 2026-05-22T09:19:07.900071+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

101 extracted references · 101 canonical work pages · 4 internal anchors

  1. [1]

    KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks

    KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks , author=. arXiv preprint arXiv:2601.06633 , year=

  2. [2]

    The International FLAIRS Conference Proceedings , author=

    SelfCode 2.0: An Annotated Corpus of Student and Expert Line-by-Line Explanations of Code Examples for Automated Assessment , volume=. The International FLAIRS Conference Proceedings , author=. 2025 , month=. doi:10.32473/flairs.38.1.138727 , abstractNote=

  3. [3]

    From generation to judg- ment: Opportunities and challenges of llm-as-a-judge,

    From generation to judgment: Opportunities and challenges of llm-as-a-judge , author=. arXiv preprint arXiv:2411.16594 , year=

  4. [4]

    Efficient Estimation of Word Representations in Vector Space

    Efficient estimation of word representations in vector space , author=. arXiv preprint arXiv:1301.3781 , year=

  5. [5]

    FastText.zip: Compressing text classification models

    FastText.zip: Compressing text classification models , author=. arXiv preprint arXiv:1612.03651 , year=

  6. [6]

    In: Proceedings of the 2014 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP)

    Pennington, Jeffrey and Socher, Richard and Manning, Christopher. G lo V e: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). 2014. doi:10.3115/v1/D14-1162

  7. [7]

    Every Answer Matters: Evaluating Commonsense with Probabilistic Measures

    Cheng, Qi and Boratko, Michael and Yelugam, Pranay Kumar and O ' Gorman, Tim and Singh, Nalini and McCallum, Andrew and Li, Xiang. Every Answer Matters: Evaluating Commonsense with Probabilistic Measures. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.29

  8. [8]

    Bertscore: Evaluating text generation with

    Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q and Artzi, Yoav , journal=. Bertscore: Evaluating text generation with

  9. [9]

    Orienteering in an Information Land- scape:HowInformationSeekersGetfromHeretoThere

    Novikova, Jekaterina and Du s ek, Ond r ej and Cercas Curry, Amanda and Rieser, Verena. Why We Need New Evaluation Metrics for NLG. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1238

  10. [10]

    How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

    Liu, Chia-Wei and Lowe, Ryan and Serban, Iulian and Noseworthy, Mike and Charlin, Laurent and Pineau, Joelle. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1230

  11. [11]

    ACL , year=

    Bleu: a Method for Automatic Evaluation of Machine Translation , author=. ACL , year=

  12. [12]

    Can LLM be a Personalized Judge?

    Dong, Yijiang River and Hu, Tiancheng and Collier, Nigel. Can LLM be a Personalized Judge?. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.592

  13. [13]

    1999 , publisher=

    Foundations of statistical natural language processing , author=. 1999 , publisher=

  14. [14]

    Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005) , pages=

    Joint parsing and semantic role labeling , author=. Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005) , pages=. 2005 , organization=

  15. [15]

    The Twelfth International Conference on Learning Representations , year=

    Generative Judge for Evaluating Alignment , author=. The Twelfth International Conference on Learning Representations , year=

  16. [16]

    Is LLM -as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment

    Raina, Vyas and Liusie, Adian and Gales, Mark. Is LLM -as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.427

  17. [17]

    Open Source Language Models Can Provide Feedback: Evaluating

    Koutcheme, Charles and Dainese, Nicola and Sarsa, Sami and Hellas, Arto and Leinonen, Juho and Denny, Paul , booktitle=. Open Source Language Models Can Provide Feedback: Evaluating

  18. [18]

    Benchmarking Cognitive Biases in Large Language Models as Evaluators

    Koo, Ryan and Lee, Minhwa and Raheja, Vipul and Park, Jong Inn and Kim, Zae Myung and Kang, Dongyeop. Benchmarking Cognitive Biases in Large Language Models as Evaluators. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.29

  19. [19]

    arXiv preprint arXiv:2402.01580 , year=

    Generative AI for Education (GAIED): Advances, Opportunities, and Challenges , author=. arXiv preprint arXiv:2402.01580 , year=

  20. [20]

    Smith IV au2, Max Fowler, James Prather, Brett A

    Explaining Code with a Purpose: An Integrated Approach for Developing Code Comprehension and Prompting Skills , author=. arXiv preprint arXiv:2403.06050 , year=

  21. [21]

    2014 , publisher=

    Evaluating the quality of learning: The SOLO taxonomy (Structure of the Observed Learning Outcome) , author=. 2014 , publisher=

  22. [22]

    Journal of Computer Assisted Learning , volume=

    Automating autograding: Large language models as test suite generators for introductory programming , author=. Journal of Computer Assisted Learning , volume=. 2025 , publisher=

  23. [23]

    Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing , pages =

    Hassany, Mohammad and Ke, Jiaze and Brusilovsky, Peter and Lekshmi Narayanan, Arun Balajiee and Akhuseyinoglu, Kamil , title =. Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing , pages =. 2024 , isbn =. doi:10.1145/3605098.3636160 , abstract =

  24. [24]

    Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

    Codetailor: Llm-powered personalized parsons puzzles for engaging support while learning programming , author=. Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

  25. [25]

    Generating Effective Distractors for Introductory Programming Challenges:

    Hassany, Mohammad and Brusilovsky, Peter and Savelka, Jaromir and Lekshmi Narayanan, Arun Balajiee and Akhuseyinoglu, Kamil and Agarwal, Arav and Hendrawan, Rully Agus , booktitle=. Generating Effective Distractors for Introductory Programming Challenges:

  26. [26]

    Proceedings of the 18th Koli Calling International Conference on Computing Education Research , articleno =

    PCEX: Interactive Program Construction Examples for Learning Programming , authors =. Proceedings of the 18th Koli Calling International Conference on Computing Education Research , articleno =. 2018 , isbn =. doi:https://doi.org/10.1145/3279720.3279726 , abstract =

  27. [27]

    Chi, Michelene TH and Wylie, Ruth , journal=. The. 2014 , publisher=

  28. [28]

    Proceedings of the second (2015) ACM conference on learning@ scale , pages=

    Learning is not a spectator sport: Doing is better than watching for learning from a MOOC , author=. Proceedings of the second (2015) ACM conference on learning@ scale , pages=

  29. [29]

    International Journal of Artificial Intelligence in Education , volume=

    Improving engagement in program construction examples for learning Python programming , author=. International Journal of Artificial Intelligence in Education , volume=. 2020 , publisher=

  30. [30]

    Chi, Micheline T. H. and Bassok, Miriam and Lewis, Matthew W. and Reimann, Peter and Glaser, Robert , title =. Cognitive Science , volume =

  31. [31]

    Automated Assessment of Students’ Code Comprehension using

    Oli, Priti and Banjade, Rabin and Chapagain, Jeevan and Rus, Vasile , booktitle =. Automated Assessment of Students’ Code Comprehension using. 2024 , editor =

  32. [32]

    tutoring for enhancing code comprehension for novices , author=

    Exploring the effectiveness of reading vs. tutoring for enhancing code comprehension for novices , author=. Proceedings of the 39th ACM/SIGAPP symposium on applied computing , pages=

  33. [33]

    SEMILAR : The Semantic Similarity Toolkit

    Rus, Vasile and Lintean, Mihai and Banjade, Rajendra and Niraula, Nobal and Stefanescu, Dan. SEMILAR : The Semantic Similarity Toolkit. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2013

  34. [34]

    Sentence-BERT: Sentence Embeddings using Siamese

    Reimers, Nils and Gurevych, Iryna , booktitle=. Sentence-BERT: Sentence Embeddings using Siamese

  35. [35]

    The international

    Automated assessment of student self-explanation during source code comprehension , author=. The international

  36. [36]

    International Conference on Artificial Intelligence in Education , pages=

    A fairness evaluation of automated methods for scoring text evidence usage in writing , author=. International Conference on Artificial Intelligence in Education , pages=. 2021 , organization=

  37. [37]

    Proceedings of the

    Argument mining for improving the automated scoring of persuasive essays , author=. Proceedings of the

  38. [38]

    The many dimensions of algorithmic fairness in educational applications

    Loukina, Anastassia and Madnani, Nitin and Zechner, Klaus. The many dimensions of algorithmic fairness in educational applications. Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. 2019. doi:10.18653/v1/W19-4401

  39. [39]

    Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

    Propagating Large Language Models Programming Feedback , author=. Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

  40. [40]

    The Thirteenth International Conference on Learning Representations , year=

    GReaTer: Gradients Over Reasoning Makes Smaller Language Models Strong Prompt Optimizers , author=. The Thirteenth International Conference on Learning Representations , year=

  41. [41]

    Aaai , volume=

    Corpus-based and knowledge-based measures of text semantic similarity , author=. Aaai , volume=

  42. [42]

    8th Educational Data Mining in Computer Science Education (CSEDM) Workshop at EDM2024 , editor =

    Lekshmi-Narayanan, Arun-Balajiee and Brusilovsky, Peter , title =. 8th Educational Data Mining in Computer Science Education (CSEDM) Workshop at EDM2024 , editor =. 2024 , type =

  43. [43]

    Proceedings of Machine Learning Research , volume =

    Lekshmi-Narayanan, Arun-Balajiee and Oli, Priti and Chapagain, Jeevan and Hassany, Mohammad and Banjade, Rabin and Brusilovsky, Peter and Rus, Vasile , title =. Proceedings of Machine Learning Research , volume =. 2024 , type =

  44. [44]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  45. [45]

    Proceedings of the 56th ACM Technical Symposium on Computer Science Education V

    Feasibility study of augmenting teaching assistants with ai for cs1 programming feedback , author=. Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1 , pages=

  46. [46]

    Direct Repair Optimization: Training Small Language Models For Educational Program Repair Improves Feedback

    Koutcheme, Charles and Dainese, Nicola and Hellas, Arto. Direct Repair Optimization: Training Small Language Models For Educational Program Repair Improves Feedback. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025). 2025. doi:10.18653/v1/2025.bea-1.41

  47. [47]

    Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation , year =

    Phung, Tung and P. Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation , year =. Proceedings of the 14th Learning Analytics and Knowledge Conference , pages =. doi:10.1145/3636555.3636846 , abstract =

  48. [48]

    , author=

    Generating High-Precision Feedback for Programming Syntax Errors Using Large Language Models. , author=. International Educational Data Mining Society , year=

  49. [49]

    NeurIPS’23 Workshop Generative AI for Education (GAIED)

    Improving the coverage of gpt for automated feedback on high school programming assignments , author=. NeurIPS’23 Workshop Generative AI for Education (GAIED). MIT Press, New Orleans, Louisiana, USA , volume=

  50. [50]

    Proceedings of the 17th International Conference on Educational Data Mining , pages=

    Generating Feedback-Ladders for Logical Errors in Programming using Large Language Models , author=. Proceedings of the 17th International Conference on Educational Data Mining , pages=

  51. [51]

    Text summarization branches out , pages=

    Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

  52. [52]

    arXiv preprint arXiv:2410.05193 , year=

    Reviseval: Improving llm-as-a-judge via response-adapted references , author=. arXiv preprint arXiv:2410.05193 , year=

  53. [53]

    Parameters driving effectiveness of automated essay scoring with

    Wild, Fridolin and Stahl, Christina and Stermsek, Gerald and Neumann, Gustaf , year=. Parameters driving effectiveness of automated essay scoring with

  54. [54]

    Behavior research methods, instruments, & computers , volume=

    Coh-Metrix: Analysis of text on cohesion and language , author=. Behavior research methods, instruments, & computers , volume=. 2004 , publisher=

  55. [55]

    Graesser and Peter Wiemer-Hastings and Katja Wiemer-Hastings and Derek Harter and Tutoring Research Group Tutoring Research Group and Natalie Person , title =

    Arthur C. Graesser and Peter Wiemer-Hastings and Katja Wiemer-Hastings and Derek Harter and Tutoring Research Group Tutoring Research Group and Natalie Person , title =. Interactive Learning Environments , volume =. 2000 , publisher =. doi:10.1076/1049-4820(200008)8:2;1-B;FT129 , URL =

  56. [56]

    Proceedings of the second workshop on Building Educational Applications Using

    Automatic essay grading with probabilistic latent semantic analysis , author=. Proceedings of the second workshop on Building Educational Applications Using

  57. [57]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  58. [58]

    Review of educational research , volume=

    Learning from examples: Instructional principles from the worked examples research , author=. Review of educational research , volume=. 2000 , publisher=

  59. [59]

    Cognition and instruction , volume=

    The use of worked examples as a substitute for problem solving in learning algebra , author=. Cognition and instruction , volume=. 1985 , publisher=

  60. [60]

    Joint Conference on Digital Libraries, JCDL 2008 , pages =

    Brusilovsky, Peter and Hsiao, I-Han and Yudelson, Michael , title =. Joint Conference on Digital Libraries, JCDL 2008 , pages =. 2008 , type =

  61. [61]

    International Journal on the Man-Machine Studies , volume =

    Linn, Marcia , title =. International Journal on the Man-Machine Studies , volume =. 1992 , type =

  62. [62]

    1995 , type =

    Kelley, Al and Pohl, Ira , title =. 1995 , type =

  63. [63]

    and Deitel, Paul J

    Deitel, Harvey M. and Deitel, Paul J. , title =. 1994 , type =

  64. [64]

    Proceedings of the Fourth (2017) ACM Conference on Learningat Scale , publisher =

    Sharrock, Remi and Hamonic, Ella and Hiron, Mathias and Carlier, Sebastien , title =. Proceedings of the Fourth (2017) ACM Conference on Learningat Scale , publisher =. 2017 , type =. doi:10.1145/3051457.3053970 , url =

  65. [65]

    and Settle, Amber , booktitle =

    Vihavainen, Arto and Miller, Craig S. and Settle, Amber , booktitle =. Benefits of Self-explanation in Introductory Programming , --url =. doi:10.1145/2676723.2677260 , keywords =

  66. [66]

    Chi, Michelene T. H. and De Leeuw, Nicholas and Chiu, Mei-Hung and Lavancher, Christian , title =. Cognitive Science Volume 18, Issue 3, July–September, Pages , volume =. 1994 , type =

  67. [67]

    , title =

    Garces, Sebastian and Vieira, Camilo and Ravai, Guity and Magana, Alejandra J. , title =. Education and Information Technologies , volume =. 2023 , type =

  68. [68]

    and Brown, Ann L

    Bielaczyc, Katerine and Pirolli, Peter L. and Brown, Ann L. , title =. Cognition and Instruction , volume =. 1995 , type =

  69. [69]

    5th Educational Data Mining in Computer Science Education (CSEDM) Workshop at EDM2021 , publisher =

    Rus, Vasile and Akhuseyinoglu, Kamil and Chapagain, Jeevan and Tamang, Lasang and Brusilovsky, Peter , title =. 5th Educational Data Mining in Computer Science Education (CSEDM) Workshop at EDM2021 , publisher =. 2021 , type =

  70. [70]

    Automated Assessment of Students’ Code Comprehension using

    Oli, Priti and Banjade, Rabin and Chapagain, Jeevan and Rus, Vasile , booktitle=. Automated Assessment of Students’ Code Comprehension using. 2024 , organization=

  71. [71]

    The International FLAIRS Conference Proceedings , volume=

    SelfCode: An Annotated Corpus and a Model for Automated Assessment of Self-explanation during Source Code Comprehension , author=. The International FLAIRS Conference Proceedings , volume=

  72. [72]

    Proceedings of the 11th workshop on innovative use of

    Evaluation dataset (DT-Grade) and word weighting approach towards constructed short answers assessment in tutorial dialogue context , author=. Proceedings of the 11th workshop on innovative use of

  73. [73]

    AI magazine , volume=

    Recent advances in conversational intelligent tutoring systems , author=. AI magazine , volume=

  74. [74]

    arXiv preprint arXiv:2501.10365 , year=

    Can LLMs Identify Gaps and Misconceptions in Students' Code Explanations? , author=. arXiv preprint arXiv:2501.10365 , year=

  75. [75]

    arXiv preprint arXiv:2401.05399 , year=

    Automated Assessment of Students' Code Comprehension using LLMs , author=. arXiv preprint arXiv:2401.05399 , year=

  76. [76]

    Improving Code Comprehension Through Scaffolded Self-explanations , year =

    Oli, Priti and Banjade, Rabin and Lekshmi Narayanan, Arun Balajiee and Chapagain, Jeevan and Tamang, Lasang Jimba and Brusilovsky, Peter and Rus, Vasile , booktitle =. Improving Code Comprehension Through Scaffolded Self-explanations , year =

  77. [77]

    Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=

  78. [78]

    2025 , eprint=

    gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

  79. [79]

    Chain-of-Thought Prompting for Automated Evaluation of Revision Patterns in Young Student Writing

    Li, Tianwen and Hong, Michelle and Matsumura, Lindsay Clare and Wang, Elaine Lin and Litman, Diane and Liu, Zhexiong and Correnti, Richard. Chain-of-Thought Prompting for Automated Evaluation of Revision Patterns in Young Student Writing. Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Works in Progress. 2025

  80. [80]

    The eleventh international conference on learning representations , year=

    Large language models are human-level prompt engineers , author=. The eleventh international conference on learning representations , year=

Showing first 80 references.