Exploring the Effectiveness of Using LLMs for Automated Assessment of Student Self Explanations in Programming Education
Pith reviewed 2026-05-22 09:19 UTC · model grok-4.3
The pith
LLMs offer a more effective approach than semantic similarity for binary scoring of student self-explanations in programming.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper conducts a rigorous comparison of LLMs versus semantic similarity methods for automated scoring of student self-explanations, framed as a binary classification task that requires quality datasets with balanced class distributions and domain-specific labels to produce meaningful results.
What carries the argument
Binary classification of each student self-explanation as correct or incorrect, performed either by prompting an LLM or by measuring semantic similarity to an expert reference explanation.
If this is right
- Automated scoring can support wider adoption of self-explanation activities inside large programming courses.
- LLM-based feedback becomes feasible for real-time use during worked-example study.
- Future evaluations of scoring techniques will require datasets that maintain balanced classes and domain-specific labels.
- Self-explanation combined with worked examples can be deployed at greater scale without proportional increases in instructor effort.
Where Pith is reading between the lines
- Similar head-to-head tests could be run in other subjects such as mathematics or science to see whether LLMs retain their edge outside programming.
- Hybrid scoring systems that combine LLM reasoning with similarity checks might improve robustness on edge cases.
- Moving beyond binary labels to capture degrees of correctness or specific misconceptions could build on the same comparison method.
Load-bearing premise
Student self-explanations can be meaningfully assessed through a simple binary correct/incorrect classification, and suitable balanced domain-specific datasets exist for fair comparison.
What would settle it
A new balanced dataset of student self-explanations where semantic similarity methods achieve higher classification accuracy than LLMs would show that the LLM approach is not more effective.
Figures
read the original abstract
Worked examples are step-by-step solutions to problems in a specific domain, offered to students to acquire domain-specific problem-solving skills. The effectiveness of worked examples could be enhanced by combining them with self-explanations, which ask students to explain rather than passively study each problem-solving step. The main challenge of this approach is assessing the correctness of the student's explanations. In the prevailing approach, student explanations are judged by their semantic similarity to an instructor's or domain expert's explanation. Given recent advances in LLM-based automated scoring, it remains unclear whether semantic similarity methods are still the most effective technique to automatically score textual student responses like essays or code explanations. Comparing these methods also requires quality datasets that offer distinctive features such as balanced class distributions and domain-specific labeled data for automated scoring tasks. In this paper, we present a rigorous comparison between LLMs and semantic similarity used for automated scoring, framed as a binary classification task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript explores the effectiveness of large language models (LLMs) versus semantic similarity methods for automated assessment of student self-explanations in programming education. It frames the scoring task as binary classification (correct/incorrect) and notes the requirement for quality datasets featuring balanced class distributions and domain-specific labels to enable such comparisons.
Significance. If executed with appropriate datasets and yielding clear performance differences, the comparison could inform practical choices for scalable automated scoring in programming courses. The work targets a real bottleneck in worked-example-based instruction. However, the absence of reported results, metrics, or dataset descriptions in the provided text limits assessment of whether any concrete advance is demonstrated.
major comments (2)
- [Abstract] Abstract: The text describes the planned comparison and binary classification framing but supplies no actual results, performance numbers, dataset details, or prompting methods. This prevents evaluation of whether the data or methods support the central claim of a rigorous head-to-head comparison.
- [Abstract] Abstract (binary classification framing): Self-explanations in programming frequently contain partial correctness, multiple valid aspects, or subtle misconceptions that resist clean binary labeling. Without reported inter-rater reliability, multi-label alternatives, or correlation to learning outcomes, any performance delta between LLMs and semantic similarity could be an artifact of label granularity rather than genuine methodological superiority.
minor comments (1)
- [Abstract] The abstract would benefit from a brief concrete example of a domain-specific labeled self-explanation to illustrate the balanced-class and domain-specific requirements mentioned.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript comparing LLMs and semantic similarity for automated assessment of student self-explanations in programming education. We address each major comment point by point below, indicating revisions where made to strengthen the paper.
read point-by-point responses
-
Referee: The text describes the planned comparison and binary classification framing but supplies no actual results, performance numbers, dataset details, or prompting methods. This prevents evaluation of whether the data or methods support the central claim of a rigorous head-to-head comparison.
Authors: The full manuscript includes a dedicated dataset section describing the balanced class distributions and domain-specific labels collected from programming education self-explanations, a methods section detailing the LLM prompting strategies, and a results section reporting concrete performance metrics (accuracy, F1, etc.) from the comparison. To make these elements more immediately visible, we have revised the abstract to summarize the dataset characteristics, prompting approach, and primary performance findings. revision: yes
-
Referee: Self-explanations in programming frequently contain partial correctness, multiple valid aspects, or subtle misconceptions that resist clean binary labeling. Without reported inter-rater reliability, multi-label alternatives, or correlation to learning outcomes, any performance delta between LLMs and semantic similarity could be an artifact of label granularity rather than genuine methodological superiority.
Authors: We acknowledge that binary labels cannot capture every nuance of student explanations. Our dataset labels were produced by domain experts using a protocol focused on whether the core conceptual element was correctly explained; we have now added the inter-rater reliability statistics and labeling guidelines to the methods section. We have also inserted a limitations paragraph discussing the binary framing, outlining multi-label alternatives for future work, and noting that direct correlation with learning outcomes lies outside the scope of this methodological comparison. revision: partial
Circularity Check
No circularity: empirical comparison without derivations or self-referential reductions
full rationale
This is a standard empirical study that compares LLMs against semantic similarity baselines for binary classification of student self-explanations using external datasets and models. The paper contains no equations, parameter-fitting steps, uniqueness theorems, or derivation chains that could reduce to their own inputs by construction. All performance claims rest on reported experimental results against held-out data rather than any self-definitional or self-citation load-bearing structure, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Binary classification is sufficient to represent correctness of student self-explanations
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
framed as a binary classification task... LLM approaches (F1 = 0.98, accuracy = 0.96) outperform semantic similarity methods (F1 = 0.72, accuracy = 0.65)
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction and recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
synthetically generated negative examples... 5-fold cross-validation on the balanced dataset
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks
KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks , author=. arXiv preprint arXiv:2601.06633 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
The International FLAIRS Conference Proceedings , author=
SelfCode 2.0: An Annotated Corpus of Student and Expert Line-by-Line Explanations of Code Examples for Automated Assessment , volume=. The International FLAIRS Conference Proceedings , author=. 2025 , month=. doi:10.32473/flairs.38.1.138727 , abstractNote=
-
[3]
From generation to judg- ment: Opportunities and challenges of llm-as-a-judge,
From generation to judgment: Opportunities and challenges of llm-as-a-judge , author=. arXiv preprint arXiv:2411.16594 , year=
-
[4]
Efficient Estimation of Word Representations in Vector Space
Efficient estimation of word representations in vector space , author=. arXiv preprint arXiv:1301.3781 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
FastText.zip: Compressing text classification models
FastText.zip: Compressing text classification models , author=. arXiv preprint arXiv:1612.03651 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
In: Proceedings of the 2014 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP)
Pennington, Jeffrey and Socher, Richard and Manning, Christopher. G lo V e: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). 2014. doi:10.3115/v1/D14-1162
-
[7]
Every Answer Matters: Evaluating Commonsense with Probabilistic Measures
Cheng, Qi and Boratko, Michael and Yelugam, Pranay Kumar and O ' Gorman, Tim and Singh, Nalini and McCallum, Andrew and Li, Xiang. Every Answer Matters: Evaluating Commonsense with Probabilistic Measures. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.29
-
[8]
Bertscore: Evaluating text generation with
Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q and Artzi, Yoav , journal=. Bertscore: Evaluating text generation with
-
[9]
Orienteering in an Information Land- scape:HowInformationSeekersGetfromHeretoThere
Novikova, Jekaterina and Du s ek, Ond r ej and Cercas Curry, Amanda and Rieser, Verena. Why We Need New Evaluation Metrics for NLG. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1238
-
[10]
Liu, Chia-Wei and Lowe, Ryan and Serban, Iulian and Noseworthy, Mike and Charlin, Laurent and Pineau, Joelle. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1230
-
[11]
Bleu: a Method for Automatic Evaluation of Machine Translation , author=. ACL , year=
-
[12]
Can LLM be a Personalized Judge?
Dong, Yijiang River and Hu, Tiancheng and Collier, Nigel. Can LLM be a Personalized Judge?. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.592
-
[13]
Foundations of statistical natural language processing , author=. 1999 , publisher=
work page 1999
-
[14]
Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005) , pages=
Joint parsing and semantic role labeling , author=. Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005) , pages=. 2005 , organization=
work page 2005
-
[15]
The Twelfth International Conference on Learning Representations , year=
Generative Judge for Evaluating Alignment , author=. The Twelfth International Conference on Learning Representations , year=
-
[16]
Is LLM -as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
Raina, Vyas and Liusie, Adian and Gales, Mark. Is LLM -as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.427
-
[17]
Open Source Language Models Can Provide Feedback: Evaluating
Koutcheme, Charles and Dainese, Nicola and Sarsa, Sami and Hellas, Arto and Leinonen, Juho and Denny, Paul , booktitle=. Open Source Language Models Can Provide Feedback: Evaluating
-
[18]
Benchmarking Cognitive Biases in Large Language Models as Evaluators
Koo, Ryan and Lee, Minhwa and Raheja, Vipul and Park, Jong Inn and Kim, Zae Myung and Kang, Dongyeop. Benchmarking Cognitive Biases in Large Language Models as Evaluators. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.29
-
[19]
arXiv preprint arXiv:2402.01580 , year=
Generative AI for Education (GAIED): Advances, Opportunities, and Challenges , author=. arXiv preprint arXiv:2402.01580 , year=
-
[20]
Smith IV au2, Max Fowler, James Prather, Brett A
Explaining Code with a Purpose: An Integrated Approach for Developing Code Comprehension and Prompting Skills , author=. arXiv preprint arXiv:2403.06050 , year=
-
[21]
Evaluating the quality of learning: The SOLO taxonomy (Structure of the Observed Learning Outcome) , author=. 2014 , publisher=
work page 2014
-
[22]
Journal of Computer Assisted Learning , volume=
Automating autograding: Large language models as test suite generators for introductory programming , author=. Journal of Computer Assisted Learning , volume=. 2025 , publisher=
work page 2025
-
[23]
Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing , pages =
Hassany, Mohammad and Ke, Jiaze and Brusilovsky, Peter and Lekshmi Narayanan, Arun Balajiee and Akhuseyinoglu, Kamil , title =. Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing , pages =. 2024 , isbn =. doi:10.1145/3605098.3636160 , abstract =
-
[24]
Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=
Codetailor: Llm-powered personalized parsons puzzles for engaging support while learning programming , author=. Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=
-
[25]
Generating Effective Distractors for Introductory Programming Challenges:
Hassany, Mohammad and Brusilovsky, Peter and Savelka, Jaromir and Lekshmi Narayanan, Arun Balajiee and Akhuseyinoglu, Kamil and Agarwal, Arav and Hendrawan, Rully Agus , booktitle=. Generating Effective Distractors for Introductory Programming Challenges:
-
[26]
PCEX: Interactive Program Construction Examples for Learning Programming , authors =. Proceedings of the 18th Koli Calling International Conference on Computing Education Research , articleno =. 2018 , isbn =. doi:https://doi.org/10.1145/3279720.3279726 , abstract =
-
[27]
Chi, Michelene TH and Wylie, Ruth , journal=. The. 2014 , publisher=
work page 2014
-
[28]
Proceedings of the second (2015) ACM conference on learning@ scale , pages=
Learning is not a spectator sport: Doing is better than watching for learning from a MOOC , author=. Proceedings of the second (2015) ACM conference on learning@ scale , pages=
work page 2015
-
[29]
International Journal of Artificial Intelligence in Education , volume=
Improving engagement in program construction examples for learning Python programming , author=. International Journal of Artificial Intelligence in Education , volume=. 2020 , publisher=
work page 2020
-
[30]
Chi, Micheline T. H. and Bassok, Miriam and Lewis, Matthew W. and Reimann, Peter and Glaser, Robert , title =. Cognitive Science , volume =
-
[31]
Automated Assessment of Students’ Code Comprehension using
Oli, Priti and Banjade, Rabin and Chapagain, Jeevan and Rus, Vasile , booktitle =. Automated Assessment of Students’ Code Comprehension using. 2024 , editor =
work page 2024
-
[32]
tutoring for enhancing code comprehension for novices , author=
Exploring the effectiveness of reading vs. tutoring for enhancing code comprehension for novices , author=. Proceedings of the 39th ACM/SIGAPP symposium on applied computing , pages=
-
[33]
SEMILAR : The Semantic Similarity Toolkit
Rus, Vasile and Lintean, Mihai and Banjade, Rajendra and Niraula, Nobal and Stefanescu, Dan. SEMILAR : The Semantic Similarity Toolkit. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2013
work page 2013
-
[34]
Sentence-BERT: Sentence Embeddings using Siamese
Reimers, Nils and Gurevych, Iryna , booktitle=. Sentence-BERT: Sentence Embeddings using Siamese
-
[35]
Automated assessment of student self-explanation during source code comprehension , author=. The international
-
[36]
International Conference on Artificial Intelligence in Education , pages=
A fairness evaluation of automated methods for scoring text evidence usage in writing , author=. International Conference on Artificial Intelligence in Education , pages=. 2021 , organization=
work page 2021
-
[37]
Argument mining for improving the automated scoring of persuasive essays , author=. Proceedings of the
-
[38]
The many dimensions of algorithmic fairness in educational applications
Loukina, Anastassia and Madnani, Nitin and Zechner, Klaus. The many dimensions of algorithmic fairness in educational applications. Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. 2019. doi:10.18653/v1/W19-4401
-
[39]
Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=
Propagating Large Language Models Programming Feedback , author=. Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=
-
[40]
The Thirteenth International Conference on Learning Representations , year=
GReaTer: Gradients Over Reasoning Makes Smaller Language Models Strong Prompt Optimizers , author=. The Thirteenth International Conference on Learning Representations , year=
-
[41]
Corpus-based and knowledge-based measures of text semantic similarity , author=. Aaai , volume=
-
[42]
8th Educational Data Mining in Computer Science Education (CSEDM) Workshop at EDM2024 , editor =
Lekshmi-Narayanan, Arun-Balajiee and Brusilovsky, Peter , title =. 8th Educational Data Mining in Computer Science Education (CSEDM) Workshop at EDM2024 , editor =. 2024 , type =
work page 2024
-
[43]
Proceedings of Machine Learning Research , volume =
Lekshmi-Narayanan, Arun-Balajiee and Oli, Priti and Chapagain, Jeevan and Hassany, Mohammad and Banjade, Rabin and Brusilovsky, Peter and Rus, Vasile , title =. Proceedings of Machine Learning Research , volume =. 2024 , type =
work page 2024
-
[44]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[45]
Proceedings of the 56th ACM Technical Symposium on Computer Science Education V
Feasibility study of augmenting teaching assistants with ai for cs1 programming feedback , author=. Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1 , pages=
-
[46]
Koutcheme, Charles and Dainese, Nicola and Hellas, Arto. Direct Repair Optimization: Training Small Language Models For Educational Program Repair Improves Feedback. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025). 2025. doi:10.18653/v1/2025.bea-1.41
-
[47]
Phung, Tung and P. Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation , year =. Proceedings of the 14th Learning Analytics and Knowledge Conference , pages =. doi:10.1145/3636555.3636846 , abstract =
- [48]
-
[49]
NeurIPS’23 Workshop Generative AI for Education (GAIED)
Improving the coverage of gpt for automated feedback on high school programming assignments , author=. NeurIPS’23 Workshop Generative AI for Education (GAIED). MIT Press, New Orleans, Louisiana, USA , volume=
-
[50]
Proceedings of the 17th International Conference on Educational Data Mining , pages=
Generating Feedback-Ladders for Logical Errors in Programming using Large Language Models , author=. Proceedings of the 17th International Conference on Educational Data Mining , pages=
-
[51]
Text summarization branches out , pages=
Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=
-
[52]
arXiv preprint arXiv:2410.05193 , year=
Reviseval: Improving llm-as-a-judge via response-adapted references , author=. arXiv preprint arXiv:2410.05193 , year=
-
[53]
Parameters driving effectiveness of automated essay scoring with
Wild, Fridolin and Stahl, Christina and Stermsek, Gerald and Neumann, Gustaf , year=. Parameters driving effectiveness of automated essay scoring with
-
[54]
Behavior research methods, instruments, & computers , volume=
Coh-Metrix: Analysis of text on cohesion and language , author=. Behavior research methods, instruments, & computers , volume=. 2004 , publisher=
work page 2004
-
[55]
Arthur C. Graesser and Peter Wiemer-Hastings and Katja Wiemer-Hastings and Derek Harter and Tutoring Research Group Tutoring Research Group and Natalie Person , title =. Interactive Learning Environments , volume =. 2000 , publisher =. doi:10.1076/1049-4820(200008)8:2;1-B;FT129 , URL =
-
[56]
Proceedings of the second workshop on Building Educational Applications Using
Automatic essay grading with probabilistic latent semantic analysis , author=. Proceedings of the second workshop on Building Educational Applications Using
-
[57]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
-
[58]
Review of educational research , volume=
Learning from examples: Instructional principles from the worked examples research , author=. Review of educational research , volume=. 2000 , publisher=
work page 2000
-
[59]
Cognition and instruction , volume=
The use of worked examples as a substitute for problem solving in learning algebra , author=. Cognition and instruction , volume=. 1985 , publisher=
work page 1985
-
[60]
Joint Conference on Digital Libraries, JCDL 2008 , pages =
Brusilovsky, Peter and Hsiao, I-Han and Yudelson, Michael , title =. Joint Conference on Digital Libraries, JCDL 2008 , pages =. 2008 , type =
work page 2008
-
[61]
International Journal on the Man-Machine Studies , volume =
Linn, Marcia , title =. International Journal on the Man-Machine Studies , volume =. 1992 , type =
work page 1992
- [62]
- [63]
-
[64]
Proceedings of the Fourth (2017) ACM Conference on Learningat Scale , publisher =
Sharrock, Remi and Hamonic, Ella and Hiron, Mathias and Carlier, Sebastien , title =. Proceedings of the Fourth (2017) ACM Conference on Learningat Scale , publisher =. 2017 , type =. doi:10.1145/3051457.3053970 , url =
-
[65]
and Settle, Amber , booktitle =
Vihavainen, Arto and Miller, Craig S. and Settle, Amber , booktitle =. Benefits of Self-explanation in Introductory Programming , --url =. doi:10.1145/2676723.2677260 , keywords =
-
[66]
Chi, Michelene T. H. and De Leeuw, Nicholas and Chiu, Mei-Hung and Lavancher, Christian , title =. Cognitive Science Volume 18, Issue 3, July–September, Pages , volume =. 1994 , type =
work page 1994
- [67]
-
[68]
Bielaczyc, Katerine and Pirolli, Peter L. and Brown, Ann L. , title =. Cognition and Instruction , volume =. 1995 , type =
work page 1995
-
[69]
5th Educational Data Mining in Computer Science Education (CSEDM) Workshop at EDM2021 , publisher =
Rus, Vasile and Akhuseyinoglu, Kamil and Chapagain, Jeevan and Tamang, Lasang and Brusilovsky, Peter , title =. 5th Educational Data Mining in Computer Science Education (CSEDM) Workshop at EDM2021 , publisher =. 2021 , type =
work page 2021
-
[70]
Automated Assessment of Students’ Code Comprehension using
Oli, Priti and Banjade, Rabin and Chapagain, Jeevan and Rus, Vasile , booktitle=. Automated Assessment of Students’ Code Comprehension using. 2024 , organization=
work page 2024
-
[71]
The International FLAIRS Conference Proceedings , volume=
SelfCode: An Annotated Corpus and a Model for Automated Assessment of Self-explanation during Source Code Comprehension , author=. The International FLAIRS Conference Proceedings , volume=
-
[72]
Proceedings of the 11th workshop on innovative use of
Evaluation dataset (DT-Grade) and word weighting approach towards constructed short answers assessment in tutorial dialogue context , author=. Proceedings of the 11th workshop on innovative use of
-
[73]
Recent advances in conversational intelligent tutoring systems , author=. AI magazine , volume=
-
[74]
arXiv preprint arXiv:2501.10365 , year=
Can LLMs Identify Gaps and Misconceptions in Students' Code Explanations? , author=. arXiv preprint arXiv:2501.10365 , year=
-
[75]
arXiv preprint arXiv:2401.05399 , year=
Automated Assessment of Students' Code Comprehension using LLMs , author=. arXiv preprint arXiv:2401.05399 , year=
-
[76]
Improving Code Comprehension Through Scaffolded Self-explanations , year =
Oli, Priti and Banjade, Rabin and Lekshmi Narayanan, Arun Balajiee and Chapagain, Jeevan and Tamang, Lasang Jimba and Brusilovsky, Peter and Rus, Vasile , booktitle =. Improving Code Comprehension Through Scaffolded Self-explanations , year =
-
[77]
Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=
- [78]
-
[79]
Chain-of-Thought Prompting for Automated Evaluation of Revision Patterns in Young Student Writing
Li, Tianwen and Hong, Michelle and Matsumura, Lindsay Clare and Wang, Elaine Lin and Litman, Diane and Liu, Zhexiong and Correnti, Richard. Chain-of-Thought Prompting for Automated Evaluation of Revision Patterns in Young Student Writing. Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Works in Progress. 2025
work page 2025
-
[80]
The eleventh international conference on learning representations , year=
Large language models are human-level prompt engineers , author=. The eleventh international conference on learning representations , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.