Exploring the Effectiveness of Using LLMs for Automated Assessment of Student Self Explanations in Programming Education

Arun-Balajiee Lekshmi-Narayanan; Mohammad Hassany; Peter Brusilovsky

arxiv: 2605.21614 · v1 · pith:VN3NTATYnew · submitted 2026-05-20 · 💻 cs.HC · cs.LG

Exploring the Effectiveness of Using LLMs for Automated Assessment of Student Self Explanations in Programming Education

Arun-Balajiee Lekshmi-Narayanan , Mohammad Hassany , Peter Brusilovsky This is my paper

Pith reviewed 2026-05-22 09:19 UTC · model grok-4.3

classification 💻 cs.HC cs.LG

keywords LLMself-explanationsautomated assessmentprogramming educationsemantic similaritybinary classificationworked examplesstudent responses

0 comments

The pith

LLMs offer a more effective approach than semantic similarity for binary scoring of student self-explanations in programming.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether large language models outperform semantic similarity methods when automatically judging the correctness of student self-explanations of programming steps. It frames this assessment as a binary classification task and stresses the need for datasets that feature balanced correct/incorrect examples along with domain-specific labels. A sympathetic reader would care because self-explanations strengthen learning from worked examples, yet manual review does not scale to large classes, making reliable automation essential for wider use of this teaching technique. The work tests if recent LLM advances have displaced older similarity-based scoring.

Core claim

The paper conducts a rigorous comparison of LLMs versus semantic similarity methods for automated scoring of student self-explanations, framed as a binary classification task that requires quality datasets with balanced class distributions and domain-specific labels to produce meaningful results.

What carries the argument

Binary classification of each student self-explanation as correct or incorrect, performed either by prompting an LLM or by measuring semantic similarity to an expert reference explanation.

If this is right

Automated scoring can support wider adoption of self-explanation activities inside large programming courses.
LLM-based feedback becomes feasible for real-time use during worked-example study.
Future evaluations of scoring techniques will require datasets that maintain balanced classes and domain-specific labels.
Self-explanation combined with worked examples can be deployed at greater scale without proportional increases in instructor effort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar head-to-head tests could be run in other subjects such as mathematics or science to see whether LLMs retain their edge outside programming.
Hybrid scoring systems that combine LLM reasoning with similarity checks might improve robustness on edge cases.
Moving beyond binary labels to capture degrees of correctness or specific misconceptions could build on the same comparison method.

Load-bearing premise

Student self-explanations can be meaningfully assessed through a simple binary correct/incorrect classification, and suitable balanced domain-specific datasets exist for fair comparison.

What would settle it

A new balanced dataset of student self-explanations where semantic similarity methods achieve higher classification accuracy than LLMs would show that the LLM approach is not more effective.

Figures

Figures reproduced from arXiv: 2605.21614 by Arun-Balajiee Lekshmi-Narayanan, Mohammad Hassany, Peter Brusilovsky.

**Figure 2.** Figure 2: The AI-generated explanations remain similar [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Aritifical Generation to augment the original dataset with negative (incorrect) code examples [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Worked examples are step-by-step solutions to problems in a specific domain, offered to students to acquire domain-specific problem-solving skills. The effectiveness of worked examples could be enhanced by combining them with self-explanations, which ask students to explain rather than passively study each problem-solving step. The main challenge of this approach is assessing the correctness of the student's explanations. In the prevailing approach, student explanations are judged by their semantic similarity to an instructor's or domain expert's explanation. Given recent advances in LLM-based automated scoring, it remains unclear whether semantic similarity methods are still the most effective technique to automatically score textual student responses like essays or code explanations. Comparing these methods also requires quality datasets that offer distinctive features such as balanced class distributions and domain-specific labeled data for automated scoring tasks. In this paper, we present a rigorous comparison between LLMs and semantic similarity used for automated scoring, framed as a binary classification task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper compares LLMs to semantic similarity for scoring programming self-explanations as binary classification, but the label setup looks like the main weak point.

read the letter

The main thing to know is that this paper sets up and runs a direct comparison between LLMs and semantic similarity methods for automatically scoring student self-explanations in programming worked examples, all framed as a binary correct/incorrect task. They note that semantic similarity has been the default but recent LLM advances might change that, and they stress the need for solid datasets with balanced classes and domain-specific labels.

Referee Report

2 major / 1 minor

Summary. The manuscript explores the effectiveness of large language models (LLMs) versus semantic similarity methods for automated assessment of student self-explanations in programming education. It frames the scoring task as binary classification (correct/incorrect) and notes the requirement for quality datasets featuring balanced class distributions and domain-specific labels to enable such comparisons.

Significance. If executed with appropriate datasets and yielding clear performance differences, the comparison could inform practical choices for scalable automated scoring in programming courses. The work targets a real bottleneck in worked-example-based instruction. However, the absence of reported results, metrics, or dataset descriptions in the provided text limits assessment of whether any concrete advance is demonstrated.

major comments (2)

[Abstract] Abstract: The text describes the planned comparison and binary classification framing but supplies no actual results, performance numbers, dataset details, or prompting methods. This prevents evaluation of whether the data or methods support the central claim of a rigorous head-to-head comparison.
[Abstract] Abstract (binary classification framing): Self-explanations in programming frequently contain partial correctness, multiple valid aspects, or subtle misconceptions that resist clean binary labeling. Without reported inter-rater reliability, multi-label alternatives, or correlation to learning outcomes, any performance delta between LLMs and semantic similarity could be an artifact of label granularity rather than genuine methodological superiority.

minor comments (1)

[Abstract] The abstract would benefit from a brief concrete example of a domain-specific labeled self-explanation to illustrate the balanced-class and domain-specific requirements mentioned.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript comparing LLMs and semantic similarity for automated assessment of student self-explanations in programming education. We address each major comment point by point below, indicating revisions where made to strengthen the paper.

read point-by-point responses

Referee: The text describes the planned comparison and binary classification framing but supplies no actual results, performance numbers, dataset details, or prompting methods. This prevents evaluation of whether the data or methods support the central claim of a rigorous head-to-head comparison.

Authors: The full manuscript includes a dedicated dataset section describing the balanced class distributions and domain-specific labels collected from programming education self-explanations, a methods section detailing the LLM prompting strategies, and a results section reporting concrete performance metrics (accuracy, F1, etc.) from the comparison. To make these elements more immediately visible, we have revised the abstract to summarize the dataset characteristics, prompting approach, and primary performance findings. revision: yes
Referee: Self-explanations in programming frequently contain partial correctness, multiple valid aspects, or subtle misconceptions that resist clean binary labeling. Without reported inter-rater reliability, multi-label alternatives, or correlation to learning outcomes, any performance delta between LLMs and semantic similarity could be an artifact of label granularity rather than genuine methodological superiority.

Authors: We acknowledge that binary labels cannot capture every nuance of student explanations. Our dataset labels were produced by domain experts using a protocol focused on whether the core conceptual element was correctly explained; we have now added the inter-rater reliability statistics and labeling guidelines to the methods section. We have also inserted a limitations paragraph discussing the binary framing, outlining multi-label alternatives for future work, and noting that direct correlation with learning outcomes lies outside the scope of this methodological comparison. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison without derivations or self-referential reductions

full rationale

This is a standard empirical study that compares LLMs against semantic similarity baselines for binary classification of student self-explanations using external datasets and models. The paper contains no equations, parameter-fitting steps, uniqueness theorems, or derivation chains that could reduce to their own inputs by construction. All performance claims rest on reported experimental results against held-out data rather than any self-definitional or self-citation load-bearing structure, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that binary classification captures explanation quality and on the existence of suitable balanced, domain-specific labeled datasets; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Binary classification is sufficient to represent correctness of student self-explanations
The abstract frames the automated scoring task as binary classification without discussing limitations of this reduction.

pith-pipeline@v0.9.0 · 5699 in / 1201 out tokens · 36873 ms · 2026-05-22T09:19:07.900071+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

framed as a binary classification task... LLM approaches (F1 = 0.98, accuracy = 0.96) outperform semantic similarity methods (F1 = 0.72, accuracy = 0.65)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

synthetically generated negative examples... 5-fold cross-validation on the balanced dataset

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

101 extracted references · 101 canonical work pages · 4 internal anchors

[1]

KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks

KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks , author=. arXiv preprint arXiv:2601.06633 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

The International FLAIRS Conference Proceedings , author=

SelfCode 2.0: An Annotated Corpus of Student and Expert Line-by-Line Explanations of Code Examples for Automated Assessment , volume=. The International FLAIRS Conference Proceedings , author=. 2025 , month=. doi:10.32473/flairs.38.1.138727 , abstractNote=

work page doi:10.32473/flairs.38.1.138727 2025
[3]

From generation to judg- ment: Opportunities and challenges of llm-as-a-judge,

From generation to judgment: Opportunities and challenges of llm-as-a-judge , author=. arXiv preprint arXiv:2411.16594 , year=

work page arXiv
[4]

Efficient Estimation of Word Representations in Vector Space

Efficient estimation of word representations in vector space , author=. arXiv preprint arXiv:1301.3781 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

FastText.zip: Compressing text classification models

FastText.zip: Compressing text classification models , author=. arXiv preprint arXiv:1612.03651 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

In: Proceedings of the 2014 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP)

Pennington, Jeffrey and Socher, Richard and Manning, Christopher. G lo V e: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). 2014. doi:10.3115/v1/D14-1162

work page doi:10.3115/v1/d14-1162 2014
[7]

Every Answer Matters: Evaluating Commonsense with Probabilistic Measures

Cheng, Qi and Boratko, Michael and Yelugam, Pranay Kumar and O ' Gorman, Tim and Singh, Nalini and McCallum, Andrew and Li, Xiang. Every Answer Matters: Evaluating Commonsense with Probabilistic Measures. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.29

work page doi:10.18653/v1/2024.acl-long.29 2024
[8]

Bertscore: Evaluating text generation with

Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q and Artzi, Yoav , journal=. Bertscore: Evaluating text generation with

work page
[9]

Orienteering in an Information Land- scape:HowInformationSeekersGetfromHeretoThere

Novikova, Jekaterina and Du s ek, Ond r ej and Cercas Curry, Amanda and Rieser, Verena. Why We Need New Evaluation Metrics for NLG. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1238

work page doi:10.18653/v1/d17-1238 2017
[10]

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Liu, Chia-Wei and Lowe, Ryan and Serban, Iulian and Noseworthy, Mike and Charlin, Laurent and Pineau, Joelle. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1230

work page doi:10.18653/v1/d16-1230 2016
[11]

ACL , year=

Bleu: a Method for Automatic Evaluation of Machine Translation , author=. ACL , year=

work page
[12]

Can LLM be a Personalized Judge?

Dong, Yijiang River and Hu, Tiancheng and Collier, Nigel. Can LLM be a Personalized Judge?. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.592

work page doi:10.18653/v1/2024.findings-emnlp.592 2024
[13]

1999 , publisher=

Foundations of statistical natural language processing , author=. 1999 , publisher=

work page 1999
[14]

Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005) , pages=

Joint parsing and semantic role labeling , author=. Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005) , pages=. 2005 , organization=

work page 2005
[15]

The Twelfth International Conference on Learning Representations , year=

Generative Judge for Evaluating Alignment , author=. The Twelfth International Conference on Learning Representations , year=

work page
[16]

Is LLM -as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment

Raina, Vyas and Liusie, Adian and Gales, Mark. Is LLM -as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.427

work page doi:10.18653/v1/2024.emnlp-main.427 2024
[17]

Open Source Language Models Can Provide Feedback: Evaluating

Koutcheme, Charles and Dainese, Nicola and Sarsa, Sami and Hellas, Arto and Leinonen, Juho and Denny, Paul , booktitle=. Open Source Language Models Can Provide Feedback: Evaluating

work page
[18]

Benchmarking Cognitive Biases in Large Language Models as Evaluators

Koo, Ryan and Lee, Minhwa and Raheja, Vipul and Park, Jong Inn and Kim, Zae Myung and Kang, Dongyeop. Benchmarking Cognitive Biases in Large Language Models as Evaluators. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.29

work page doi:10.18653/v1/2024.findings-acl.29 2024
[19]

arXiv preprint arXiv:2402.01580 , year=

Generative AI for Education (GAIED): Advances, Opportunities, and Challenges , author=. arXiv preprint arXiv:2402.01580 , year=

work page arXiv
[20]

Smith IV au2, Max Fowler, James Prather, Brett A

Explaining Code with a Purpose: An Integrated Approach for Developing Code Comprehension and Prompting Skills , author=. arXiv preprint arXiv:2403.06050 , year=

work page arXiv
[21]

2014 , publisher=

Evaluating the quality of learning: The SOLO taxonomy (Structure of the Observed Learning Outcome) , author=. 2014 , publisher=

work page 2014
[22]

Journal of Computer Assisted Learning , volume=

Automating autograding: Large language models as test suite generators for introductory programming , author=. Journal of Computer Assisted Learning , volume=. 2025 , publisher=

work page 2025
[23]

Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing , pages =

Hassany, Mohammad and Ke, Jiaze and Brusilovsky, Peter and Lekshmi Narayanan, Arun Balajiee and Akhuseyinoglu, Kamil , title =. Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing , pages =. 2024 , isbn =. doi:10.1145/3605098.3636160 , abstract =

work page doi:10.1145/3605098.3636160 2024
[24]

Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

Codetailor: Llm-powered personalized parsons puzzles for engaging support while learning programming , author=. Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

work page
[25]

Generating Effective Distractors for Introductory Programming Challenges:

Hassany, Mohammad and Brusilovsky, Peter and Savelka, Jaromir and Lekshmi Narayanan, Arun Balajiee and Akhuseyinoglu, Kamil and Agarwal, Arav and Hendrawan, Rully Agus , booktitle=. Generating Effective Distractors for Introductory Programming Challenges:

work page
[26]

Proceedings of the 18th Koli Calling International Conference on Computing Education Research , articleno =

PCEX: Interactive Program Construction Examples for Learning Programming , authors =. Proceedings of the 18th Koli Calling International Conference on Computing Education Research , articleno =. 2018 , isbn =. doi:https://doi.org/10.1145/3279720.3279726 , abstract =

work page doi:10.1145/3279720.3279726 2018
[27]

Chi, Michelene TH and Wylie, Ruth , journal=. The. 2014 , publisher=

work page 2014
[28]

Proceedings of the second (2015) ACM conference on learning@ scale , pages=

Learning is not a spectator sport: Doing is better than watching for learning from a MOOC , author=. Proceedings of the second (2015) ACM conference on learning@ scale , pages=

work page 2015
[29]

International Journal of Artificial Intelligence in Education , volume=

Improving engagement in program construction examples for learning Python programming , author=. International Journal of Artificial Intelligence in Education , volume=. 2020 , publisher=

work page 2020
[30]

Chi, Micheline T. H. and Bassok, Miriam and Lewis, Matthew W. and Reimann, Peter and Glaser, Robert , title =. Cognitive Science , volume =

work page
[31]

Automated Assessment of Students’ Code Comprehension using

Oli, Priti and Banjade, Rabin and Chapagain, Jeevan and Rus, Vasile , booktitle =. Automated Assessment of Students’ Code Comprehension using. 2024 , editor =

work page 2024
[32]

tutoring for enhancing code comprehension for novices , author=

Exploring the effectiveness of reading vs. tutoring for enhancing code comprehension for novices , author=. Proceedings of the 39th ACM/SIGAPP symposium on applied computing , pages=

work page
[33]

SEMILAR : The Semantic Similarity Toolkit

Rus, Vasile and Lintean, Mihai and Banjade, Rajendra and Niraula, Nobal and Stefanescu, Dan. SEMILAR : The Semantic Similarity Toolkit. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2013

work page 2013
[34]

Sentence-BERT: Sentence Embeddings using Siamese

Reimers, Nils and Gurevych, Iryna , booktitle=. Sentence-BERT: Sentence Embeddings using Siamese

work page
[35]

The international

Automated assessment of student self-explanation during source code comprehension , author=. The international

work page
[36]

International Conference on Artificial Intelligence in Education , pages=

A fairness evaluation of automated methods for scoring text evidence usage in writing , author=. International Conference on Artificial Intelligence in Education , pages=. 2021 , organization=

work page 2021
[37]

Proceedings of the

Argument mining for improving the automated scoring of persuasive essays , author=. Proceedings of the

work page
[38]

The many dimensions of algorithmic fairness in educational applications

Loukina, Anastassia and Madnani, Nitin and Zechner, Klaus. The many dimensions of algorithmic fairness in educational applications. Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. 2019. doi:10.18653/v1/W19-4401

work page doi:10.18653/v1/w19-4401 2019
[39]

Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

Propagating Large Language Models Programming Feedback , author=. Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

work page
[40]

The Thirteenth International Conference on Learning Representations , year=

GReaTer: Gradients Over Reasoning Makes Smaller Language Models Strong Prompt Optimizers , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[41]

Aaai , volume=

Corpus-based and knowledge-based measures of text semantic similarity , author=. Aaai , volume=

work page
[42]

8th Educational Data Mining in Computer Science Education (CSEDM) Workshop at EDM2024 , editor =

Lekshmi-Narayanan, Arun-Balajiee and Brusilovsky, Peter , title =. 8th Educational Data Mining in Computer Science Education (CSEDM) Workshop at EDM2024 , editor =. 2024 , type =

work page 2024
[43]

Proceedings of Machine Learning Research , volume =

Lekshmi-Narayanan, Arun-Balajiee and Oli, Priti and Chapagain, Jeevan and Hassany, Mohammad and Banjade, Rabin and Brusilovsky, Peter and Rus, Vasile , title =. Proceedings of Machine Learning Research , volume =. 2024 , type =

work page 2024
[44]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[45]

Proceedings of the 56th ACM Technical Symposium on Computer Science Education V

Feasibility study of augmenting teaching assistants with ai for cs1 programming feedback , author=. Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1 , pages=

work page
[46]

Direct Repair Optimization: Training Small Language Models For Educational Program Repair Improves Feedback

Koutcheme, Charles and Dainese, Nicola and Hellas, Arto. Direct Repair Optimization: Training Small Language Models For Educational Program Repair Improves Feedback. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025). 2025. doi:10.18653/v1/2025.bea-1.41

work page doi:10.18653/v1/2025.bea-1.41 2025
[47]

Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation , year =

Phung, Tung and P. Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation , year =. Proceedings of the 14th Learning Analytics and Knowledge Conference , pages =. doi:10.1145/3636555.3636846 , abstract =

work page doi:10.1145/3636555.3636846
[48]

, author=

Generating High-Precision Feedback for Programming Syntax Errors Using Large Language Models. , author=. International Educational Data Mining Society , year=

work page
[49]

NeurIPS’23 Workshop Generative AI for Education (GAIED)

Improving the coverage of gpt for automated feedback on high school programming assignments , author=. NeurIPS’23 Workshop Generative AI for Education (GAIED). MIT Press, New Orleans, Louisiana, USA , volume=

work page
[50]

Proceedings of the 17th International Conference on Educational Data Mining , pages=

Generating Feedback-Ladders for Logical Errors in Programming using Large Language Models , author=. Proceedings of the 17th International Conference on Educational Data Mining , pages=

work page
[51]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

work page
[52]

arXiv preprint arXiv:2410.05193 , year=

Reviseval: Improving llm-as-a-judge via response-adapted references , author=. arXiv preprint arXiv:2410.05193 , year=

work page arXiv
[53]

Parameters driving effectiveness of automated essay scoring with

Wild, Fridolin and Stahl, Christina and Stermsek, Gerald and Neumann, Gustaf , year=. Parameters driving effectiveness of automated essay scoring with

work page
[54]

Behavior research methods, instruments, & computers , volume=

Coh-Metrix: Analysis of text on cohesion and language , author=. Behavior research methods, instruments, & computers , volume=. 2004 , publisher=

work page 2004
[55]

Graesser and Peter Wiemer-Hastings and Katja Wiemer-Hastings and Derek Harter and Tutoring Research Group Tutoring Research Group and Natalie Person , title =

Arthur C. Graesser and Peter Wiemer-Hastings and Katja Wiemer-Hastings and Derek Harter and Tutoring Research Group Tutoring Research Group and Natalie Person , title =. Interactive Learning Environments , volume =. 2000 , publisher =. doi:10.1076/1049-4820(200008)8:2;1-B;FT129 , URL =

work page doi:10.1076/1049-4820(200008)8:2;1-b;ft129 2000
[56]

Proceedings of the second workshop on Building Educational Applications Using

Automatic essay grading with probabilistic latent semantic analysis , author=. Proceedings of the second workshop on Building Educational Applications Using

work page
[57]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[58]

Review of educational research , volume=

Learning from examples: Instructional principles from the worked examples research , author=. Review of educational research , volume=. 2000 , publisher=

work page 2000
[59]

Cognition and instruction , volume=

The use of worked examples as a substitute for problem solving in learning algebra , author=. Cognition and instruction , volume=. 1985 , publisher=

work page 1985
[60]

Joint Conference on Digital Libraries, JCDL 2008 , pages =

Brusilovsky, Peter and Hsiao, I-Han and Yudelson, Michael , title =. Joint Conference on Digital Libraries, JCDL 2008 , pages =. 2008 , type =

work page 2008
[61]

International Journal on the Man-Machine Studies , volume =

Linn, Marcia , title =. International Journal on the Man-Machine Studies , volume =. 1992 , type =

work page 1992
[62]

1995 , type =

Kelley, Al and Pohl, Ira , title =. 1995 , type =

work page 1995
[63]

and Deitel, Paul J

Deitel, Harvey M. and Deitel, Paul J. , title =. 1994 , type =

work page 1994
[64]

Proceedings of the Fourth (2017) ACM Conference on Learningat Scale , publisher =

Sharrock, Remi and Hamonic, Ella and Hiron, Mathias and Carlier, Sebastien , title =. Proceedings of the Fourth (2017) ACM Conference on Learningat Scale , publisher =. 2017 , type =. doi:10.1145/3051457.3053970 , url =

work page doi:10.1145/3051457.3053970 2017
[65]

and Settle, Amber , booktitle =

Vihavainen, Arto and Miller, Craig S. and Settle, Amber , booktitle =. Benefits of Self-explanation in Introductory Programming , --url =. doi:10.1145/2676723.2677260 , keywords =

work page doi:10.1145/2676723.2677260
[66]

Chi, Michelene T. H. and De Leeuw, Nicholas and Chiu, Mei-Hung and Lavancher, Christian , title =. Cognitive Science Volume 18, Issue 3, July–September, Pages , volume =. 1994 , type =

work page 1994
[67]

, title =

Garces, Sebastian and Vieira, Camilo and Ravai, Guity and Magana, Alejandra J. , title =. Education and Information Technologies , volume =. 2023 , type =

work page 2023
[68]

and Brown, Ann L

Bielaczyc, Katerine and Pirolli, Peter L. and Brown, Ann L. , title =. Cognition and Instruction , volume =. 1995 , type =

work page 1995
[69]

5th Educational Data Mining in Computer Science Education (CSEDM) Workshop at EDM2021 , publisher =

Rus, Vasile and Akhuseyinoglu, Kamil and Chapagain, Jeevan and Tamang, Lasang and Brusilovsky, Peter , title =. 5th Educational Data Mining in Computer Science Education (CSEDM) Workshop at EDM2021 , publisher =. 2021 , type =

work page 2021
[70]

Automated Assessment of Students’ Code Comprehension using

Oli, Priti and Banjade, Rabin and Chapagain, Jeevan and Rus, Vasile , booktitle=. Automated Assessment of Students’ Code Comprehension using. 2024 , organization=

work page 2024
[71]

The International FLAIRS Conference Proceedings , volume=

SelfCode: An Annotated Corpus and a Model for Automated Assessment of Self-explanation during Source Code Comprehension , author=. The International FLAIRS Conference Proceedings , volume=

work page
[72]

Proceedings of the 11th workshop on innovative use of

Evaluation dataset (DT-Grade) and word weighting approach towards constructed short answers assessment in tutorial dialogue context , author=. Proceedings of the 11th workshop on innovative use of

work page
[73]

AI magazine , volume=

Recent advances in conversational intelligent tutoring systems , author=. AI magazine , volume=

work page
[74]

arXiv preprint arXiv:2501.10365 , year=

Can LLMs Identify Gaps and Misconceptions in Students' Code Explanations? , author=. arXiv preprint arXiv:2501.10365 , year=

work page arXiv
[75]

arXiv preprint arXiv:2401.05399 , year=

Automated Assessment of Students' Code Comprehension using LLMs , author=. arXiv preprint arXiv:2401.05399 , year=

work page arXiv
[76]

Improving Code Comprehension Through Scaffolded Self-explanations , year =

Oli, Priti and Banjade, Rabin and Lekshmi Narayanan, Arun Balajiee and Chapagain, Jeevan and Tamang, Lasang Jimba and Brusilovsky, Peter and Rus, Vasile , booktitle =. Improving Code Comprehension Through Scaffolded Self-explanations , year =

work page
[77]

Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=

work page
[78]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025
[79]

Chain-of-Thought Prompting for Automated Evaluation of Revision Patterns in Young Student Writing

Li, Tianwen and Hong, Michelle and Matsumura, Lindsay Clare and Wang, Elaine Lin and Litman, Diane and Liu, Zhexiong and Correnti, Richard. Chain-of-Thought Prompting for Automated Evaluation of Revision Patterns in Young Student Writing. Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Works in Progress. 2025

work page 2025
[80]

The eleventh international conference on learning representations , year=

Large language models are human-level prompt engineers , author=. The eleventh international conference on learning representations , year=

work page

Showing first 80 references.

[1] [1]

KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks

KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks , author=. arXiv preprint arXiv:2601.06633 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

The International FLAIRS Conference Proceedings , author=

SelfCode 2.0: An Annotated Corpus of Student and Expert Line-by-Line Explanations of Code Examples for Automated Assessment , volume=. The International FLAIRS Conference Proceedings , author=. 2025 , month=. doi:10.32473/flairs.38.1.138727 , abstractNote=

work page doi:10.32473/flairs.38.1.138727 2025

[3] [3]

From generation to judg- ment: Opportunities and challenges of llm-as-a-judge,

From generation to judgment: Opportunities and challenges of llm-as-a-judge , author=. arXiv preprint arXiv:2411.16594 , year=

work page arXiv

[4] [4]

Efficient Estimation of Word Representations in Vector Space

Efficient estimation of word representations in vector space , author=. arXiv preprint arXiv:1301.3781 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

FastText.zip: Compressing text classification models

FastText.zip: Compressing text classification models , author=. arXiv preprint arXiv:1612.03651 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

In: Proceedings of the 2014 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP)

Pennington, Jeffrey and Socher, Richard and Manning, Christopher. G lo V e: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). 2014. doi:10.3115/v1/D14-1162

work page doi:10.3115/v1/d14-1162 2014

[7] [7]

Every Answer Matters: Evaluating Commonsense with Probabilistic Measures

Cheng, Qi and Boratko, Michael and Yelugam, Pranay Kumar and O ' Gorman, Tim and Singh, Nalini and McCallum, Andrew and Li, Xiang. Every Answer Matters: Evaluating Commonsense with Probabilistic Measures. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.29

work page doi:10.18653/v1/2024.acl-long.29 2024

[8] [8]

Bertscore: Evaluating text generation with

Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q and Artzi, Yoav , journal=. Bertscore: Evaluating text generation with

work page

[9] [9]

Orienteering in an Information Land- scape:HowInformationSeekersGetfromHeretoThere

Novikova, Jekaterina and Du s ek, Ond r ej and Cercas Curry, Amanda and Rieser, Verena. Why We Need New Evaluation Metrics for NLG. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1238

work page doi:10.18653/v1/d17-1238 2017

[10] [10]

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Liu, Chia-Wei and Lowe, Ryan and Serban, Iulian and Noseworthy, Mike and Charlin, Laurent and Pineau, Joelle. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1230

work page doi:10.18653/v1/d16-1230 2016

[11] [11]

ACL , year=

Bleu: a Method for Automatic Evaluation of Machine Translation , author=. ACL , year=

work page

[12] [12]

Can LLM be a Personalized Judge?

Dong, Yijiang River and Hu, Tiancheng and Collier, Nigel. Can LLM be a Personalized Judge?. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.592

work page doi:10.18653/v1/2024.findings-emnlp.592 2024

[13] [13]

1999 , publisher=

Foundations of statistical natural language processing , author=. 1999 , publisher=

work page 1999

[14] [14]

Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005) , pages=

Joint parsing and semantic role labeling , author=. Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005) , pages=. 2005 , organization=

work page 2005

[15] [15]

The Twelfth International Conference on Learning Representations , year=

Generative Judge for Evaluating Alignment , author=. The Twelfth International Conference on Learning Representations , year=

work page

[16] [16]

Is LLM -as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment

Raina, Vyas and Liusie, Adian and Gales, Mark. Is LLM -as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.427

work page doi:10.18653/v1/2024.emnlp-main.427 2024

[17] [17]

Open Source Language Models Can Provide Feedback: Evaluating

Koutcheme, Charles and Dainese, Nicola and Sarsa, Sami and Hellas, Arto and Leinonen, Juho and Denny, Paul , booktitle=. Open Source Language Models Can Provide Feedback: Evaluating

work page

[18] [18]

Benchmarking Cognitive Biases in Large Language Models as Evaluators

Koo, Ryan and Lee, Minhwa and Raheja, Vipul and Park, Jong Inn and Kim, Zae Myung and Kang, Dongyeop. Benchmarking Cognitive Biases in Large Language Models as Evaluators. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.29

work page doi:10.18653/v1/2024.findings-acl.29 2024

[19] [19]

arXiv preprint arXiv:2402.01580 , year=

Generative AI for Education (GAIED): Advances, Opportunities, and Challenges , author=. arXiv preprint arXiv:2402.01580 , year=

work page arXiv

[20] [20]

Smith IV au2, Max Fowler, James Prather, Brett A

Explaining Code with a Purpose: An Integrated Approach for Developing Code Comprehension and Prompting Skills , author=. arXiv preprint arXiv:2403.06050 , year=

work page arXiv

[21] [21]

2014 , publisher=

Evaluating the quality of learning: The SOLO taxonomy (Structure of the Observed Learning Outcome) , author=. 2014 , publisher=

work page 2014

[22] [22]

Journal of Computer Assisted Learning , volume=

Automating autograding: Large language models as test suite generators for introductory programming , author=. Journal of Computer Assisted Learning , volume=. 2025 , publisher=

work page 2025

[23] [23]

Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing , pages =

Hassany, Mohammad and Ke, Jiaze and Brusilovsky, Peter and Lekshmi Narayanan, Arun Balajiee and Akhuseyinoglu, Kamil , title =. Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing , pages =. 2024 , isbn =. doi:10.1145/3605098.3636160 , abstract =

work page doi:10.1145/3605098.3636160 2024

[24] [24]

Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

Codetailor: Llm-powered personalized parsons puzzles for engaging support while learning programming , author=. Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

work page

[25] [25]

Generating Effective Distractors for Introductory Programming Challenges:

Hassany, Mohammad and Brusilovsky, Peter and Savelka, Jaromir and Lekshmi Narayanan, Arun Balajiee and Akhuseyinoglu, Kamil and Agarwal, Arav and Hendrawan, Rully Agus , booktitle=. Generating Effective Distractors for Introductory Programming Challenges:

work page

[26] [26]

Proceedings of the 18th Koli Calling International Conference on Computing Education Research , articleno =

PCEX: Interactive Program Construction Examples for Learning Programming , authors =. Proceedings of the 18th Koli Calling International Conference on Computing Education Research , articleno =. 2018 , isbn =. doi:https://doi.org/10.1145/3279720.3279726 , abstract =

work page doi:10.1145/3279720.3279726 2018

[27] [27]

Chi, Michelene TH and Wylie, Ruth , journal=. The. 2014 , publisher=

work page 2014

[28] [28]

Proceedings of the second (2015) ACM conference on learning@ scale , pages=

Learning is not a spectator sport: Doing is better than watching for learning from a MOOC , author=. Proceedings of the second (2015) ACM conference on learning@ scale , pages=

work page 2015

[29] [29]

International Journal of Artificial Intelligence in Education , volume=

Improving engagement in program construction examples for learning Python programming , author=. International Journal of Artificial Intelligence in Education , volume=. 2020 , publisher=

work page 2020

[30] [30]

Chi, Micheline T. H. and Bassok, Miriam and Lewis, Matthew W. and Reimann, Peter and Glaser, Robert , title =. Cognitive Science , volume =

work page

[31] [31]

Automated Assessment of Students’ Code Comprehension using

Oli, Priti and Banjade, Rabin and Chapagain, Jeevan and Rus, Vasile , booktitle =. Automated Assessment of Students’ Code Comprehension using. 2024 , editor =

work page 2024

[32] [32]

tutoring for enhancing code comprehension for novices , author=

Exploring the effectiveness of reading vs. tutoring for enhancing code comprehension for novices , author=. Proceedings of the 39th ACM/SIGAPP symposium on applied computing , pages=

work page

[33] [33]

SEMILAR : The Semantic Similarity Toolkit

Rus, Vasile and Lintean, Mihai and Banjade, Rajendra and Niraula, Nobal and Stefanescu, Dan. SEMILAR : The Semantic Similarity Toolkit. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2013

work page 2013

[34] [34]

Sentence-BERT: Sentence Embeddings using Siamese

Reimers, Nils and Gurevych, Iryna , booktitle=. Sentence-BERT: Sentence Embeddings using Siamese

work page

[35] [35]

The international

Automated assessment of student self-explanation during source code comprehension , author=. The international

work page

[36] [36]

International Conference on Artificial Intelligence in Education , pages=

A fairness evaluation of automated methods for scoring text evidence usage in writing , author=. International Conference on Artificial Intelligence in Education , pages=. 2021 , organization=

work page 2021

[37] [37]

Proceedings of the

Argument mining for improving the automated scoring of persuasive essays , author=. Proceedings of the

work page

[38] [38]

The many dimensions of algorithmic fairness in educational applications

Loukina, Anastassia and Madnani, Nitin and Zechner, Klaus. The many dimensions of algorithmic fairness in educational applications. Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. 2019. doi:10.18653/v1/W19-4401

work page doi:10.18653/v1/w19-4401 2019

[39] [39]

Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

Propagating Large Language Models Programming Feedback , author=. Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

work page

[40] [40]

The Thirteenth International Conference on Learning Representations , year=

GReaTer: Gradients Over Reasoning Makes Smaller Language Models Strong Prompt Optimizers , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[41] [41]

Aaai , volume=

Corpus-based and knowledge-based measures of text semantic similarity , author=. Aaai , volume=

work page

[42] [42]

8th Educational Data Mining in Computer Science Education (CSEDM) Workshop at EDM2024 , editor =

Lekshmi-Narayanan, Arun-Balajiee and Brusilovsky, Peter , title =. 8th Educational Data Mining in Computer Science Education (CSEDM) Workshop at EDM2024 , editor =. 2024 , type =

work page 2024

[43] [43]

Proceedings of Machine Learning Research , volume =

Lekshmi-Narayanan, Arun-Balajiee and Oli, Priti and Chapagain, Jeevan and Hassany, Mohammad and Banjade, Rabin and Brusilovsky, Peter and Rus, Vasile , title =. Proceedings of Machine Learning Research , volume =. 2024 , type =

work page 2024

[44] [44]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page

[45] [45]

Proceedings of the 56th ACM Technical Symposium on Computer Science Education V

Feasibility study of augmenting teaching assistants with ai for cs1 programming feedback , author=. Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1 , pages=

work page

[46] [46]

Direct Repair Optimization: Training Small Language Models For Educational Program Repair Improves Feedback

Koutcheme, Charles and Dainese, Nicola and Hellas, Arto. Direct Repair Optimization: Training Small Language Models For Educational Program Repair Improves Feedback. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025). 2025. doi:10.18653/v1/2025.bea-1.41

work page doi:10.18653/v1/2025.bea-1.41 2025

[47] [47]

Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation , year =

Phung, Tung and P. Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation , year =. Proceedings of the 14th Learning Analytics and Knowledge Conference , pages =. doi:10.1145/3636555.3636846 , abstract =

work page doi:10.1145/3636555.3636846

[48] [48]

, author=

Generating High-Precision Feedback for Programming Syntax Errors Using Large Language Models. , author=. International Educational Data Mining Society , year=

work page

[49] [49]

NeurIPS’23 Workshop Generative AI for Education (GAIED)

Improving the coverage of gpt for automated feedback on high school programming assignments , author=. NeurIPS’23 Workshop Generative AI for Education (GAIED). MIT Press, New Orleans, Louisiana, USA , volume=

work page

[50] [50]

Proceedings of the 17th International Conference on Educational Data Mining , pages=

Generating Feedback-Ladders for Logical Errors in Programming using Large Language Models , author=. Proceedings of the 17th International Conference on Educational Data Mining , pages=

work page

[51] [51]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

work page

[52] [52]

arXiv preprint arXiv:2410.05193 , year=

Reviseval: Improving llm-as-a-judge via response-adapted references , author=. arXiv preprint arXiv:2410.05193 , year=

work page arXiv

[53] [53]

Parameters driving effectiveness of automated essay scoring with

Wild, Fridolin and Stahl, Christina and Stermsek, Gerald and Neumann, Gustaf , year=. Parameters driving effectiveness of automated essay scoring with

work page

[54] [54]

Behavior research methods, instruments, & computers , volume=

Coh-Metrix: Analysis of text on cohesion and language , author=. Behavior research methods, instruments, & computers , volume=. 2004 , publisher=

work page 2004

[55] [55]

Graesser and Peter Wiemer-Hastings and Katja Wiemer-Hastings and Derek Harter and Tutoring Research Group Tutoring Research Group and Natalie Person , title =

Arthur C. Graesser and Peter Wiemer-Hastings and Katja Wiemer-Hastings and Derek Harter and Tutoring Research Group Tutoring Research Group and Natalie Person , title =. Interactive Learning Environments , volume =. 2000 , publisher =. doi:10.1076/1049-4820(200008)8:2;1-B;FT129 , URL =

work page doi:10.1076/1049-4820(200008)8:2;1-b;ft129 2000

[56] [56]

Proceedings of the second workshop on Building Educational Applications Using

Automatic essay grading with probabilistic latent semantic analysis , author=. Proceedings of the second workshop on Building Educational Applications Using

work page

[57] [57]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page

[58] [58]

Review of educational research , volume=

Learning from examples: Instructional principles from the worked examples research , author=. Review of educational research , volume=. 2000 , publisher=

work page 2000

[59] [59]

Cognition and instruction , volume=

The use of worked examples as a substitute for problem solving in learning algebra , author=. Cognition and instruction , volume=. 1985 , publisher=

work page 1985

[60] [60]

Joint Conference on Digital Libraries, JCDL 2008 , pages =

Brusilovsky, Peter and Hsiao, I-Han and Yudelson, Michael , title =. Joint Conference on Digital Libraries, JCDL 2008 , pages =. 2008 , type =

work page 2008

[61] [61]

International Journal on the Man-Machine Studies , volume =

Linn, Marcia , title =. International Journal on the Man-Machine Studies , volume =. 1992 , type =

work page 1992

[62] [62]

1995 , type =

Kelley, Al and Pohl, Ira , title =. 1995 , type =

work page 1995

[63] [63]

and Deitel, Paul J

Deitel, Harvey M. and Deitel, Paul J. , title =. 1994 , type =

work page 1994

[64] [64]

Proceedings of the Fourth (2017) ACM Conference on Learningat Scale , publisher =

Sharrock, Remi and Hamonic, Ella and Hiron, Mathias and Carlier, Sebastien , title =. Proceedings of the Fourth (2017) ACM Conference on Learningat Scale , publisher =. 2017 , type =. doi:10.1145/3051457.3053970 , url =

work page doi:10.1145/3051457.3053970 2017

[65] [65]

and Settle, Amber , booktitle =

Vihavainen, Arto and Miller, Craig S. and Settle, Amber , booktitle =. Benefits of Self-explanation in Introductory Programming , --url =. doi:10.1145/2676723.2677260 , keywords =

work page doi:10.1145/2676723.2677260

[66] [66]

Chi, Michelene T. H. and De Leeuw, Nicholas and Chiu, Mei-Hung and Lavancher, Christian , title =. Cognitive Science Volume 18, Issue 3, July–September, Pages , volume =. 1994 , type =

work page 1994

[67] [67]

, title =

Garces, Sebastian and Vieira, Camilo and Ravai, Guity and Magana, Alejandra J. , title =. Education and Information Technologies , volume =. 2023 , type =

work page 2023

[68] [68]

and Brown, Ann L

Bielaczyc, Katerine and Pirolli, Peter L. and Brown, Ann L. , title =. Cognition and Instruction , volume =. 1995 , type =

work page 1995

[69] [69]

5th Educational Data Mining in Computer Science Education (CSEDM) Workshop at EDM2021 , publisher =

Rus, Vasile and Akhuseyinoglu, Kamil and Chapagain, Jeevan and Tamang, Lasang and Brusilovsky, Peter , title =. 5th Educational Data Mining in Computer Science Education (CSEDM) Workshop at EDM2021 , publisher =. 2021 , type =

work page 2021

[70] [70]

Automated Assessment of Students’ Code Comprehension using

Oli, Priti and Banjade, Rabin and Chapagain, Jeevan and Rus, Vasile , booktitle=. Automated Assessment of Students’ Code Comprehension using. 2024 , organization=

work page 2024

[71] [71]

The International FLAIRS Conference Proceedings , volume=

SelfCode: An Annotated Corpus and a Model for Automated Assessment of Self-explanation during Source Code Comprehension , author=. The International FLAIRS Conference Proceedings , volume=

work page

[72] [72]

Proceedings of the 11th workshop on innovative use of

Evaluation dataset (DT-Grade) and word weighting approach towards constructed short answers assessment in tutorial dialogue context , author=. Proceedings of the 11th workshop on innovative use of

work page

[73] [73]

AI magazine , volume=

Recent advances in conversational intelligent tutoring systems , author=. AI magazine , volume=

work page

[74] [74]

arXiv preprint arXiv:2501.10365 , year=

Can LLMs Identify Gaps and Misconceptions in Students' Code Explanations? , author=. arXiv preprint arXiv:2501.10365 , year=

work page arXiv

[75] [75]

arXiv preprint arXiv:2401.05399 , year=

Automated Assessment of Students' Code Comprehension using LLMs , author=. arXiv preprint arXiv:2401.05399 , year=

work page arXiv

[76] [76]

Improving Code Comprehension Through Scaffolded Self-explanations , year =

Oli, Priti and Banjade, Rabin and Lekshmi Narayanan, Arun Balajiee and Chapagain, Jeevan and Tamang, Lasang Jimba and Brusilovsky, Peter and Rus, Vasile , booktitle =. Improving Code Comprehension Through Scaffolded Self-explanations , year =

work page

[77] [77]

Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=

work page

[78] [78]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025

[79] [79]

Chain-of-Thought Prompting for Automated Evaluation of Revision Patterns in Young Student Writing

Li, Tianwen and Hong, Michelle and Matsumura, Lindsay Clare and Wang, Elaine Lin and Litman, Diane and Liu, Zhexiong and Correnti, Richard. Chain-of-Thought Prompting for Automated Evaluation of Revision Patterns in Young Student Writing. Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Works in Progress. 2025

work page 2025

[80] [80]

The eleventh international conference on learning representations , year=

Large language models are human-level prompt engineers , author=. The eleventh international conference on learning representations , year=

work page