Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach

Francisco J.Rodriguez-Martinez; Lorena Otero-Cerdeira; Manuel Alonso-Carracedo; Pedro Celard; Ruben Fernandez-Boullon

arxiv: 2607.02432 · v1 · pith:JVHLNIWNnew · submitted 2026-07-02 · 💻 cs.AI · cs.CL· cs.CY

Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach

Manuel Alonso-Carracedo , Ruben Fernandez-Boullon , Pedro Celard , Francisco J.Rodriguez-Martinez , Lorena Otero-Cerdeira This is my paper

Pith reviewed 2026-07-03 13:13 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CY

keywords automated gradinglarge language modelsLinux bashcognitive taxonomycommand-line examinationsAI-assisted assessmentrubric promptingstudent response evaluation

0 comments

The pith

Gemini 3.0 Pro with rubric-guided prompts matches expert grading on Linux and bash answers, but agreement falls as cognitive complexity rises.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether frontier large language models can grade short Linux and bash command responses at the level of human instructors. It defines a four-level taxonomy that sorts questions by cognitive complexity and operational impact, then tests four models on 1200 real student answers using both basic and rubric-enhanced prompts. The strongest result shows high agreement with expert grades, yet performance drops steadily at higher taxonomy levels. Rubric structure matters more than model choice. The work supplies a practical way to decide which exam items can use AI grading and which still need human review.

Core claim

Gemini 3.0 Pro with rubric-guided prompting reached the highest agreement with three human graders, ICC(3,1) = 0.888, MAE = 0.10, and near-zero bias. Agreement declined consistently with taxonomy level, with the largest gaps at L3 and L4. Across all four models, rubric quality produced larger gains than provider differences, and structured prompts improved results uniformly. Question complexity therefore serves as a reliable predictor of LLM grading difficulty and supplies a basis for deciding which items are suitable for AI assistance.

What carries the argument

A four-level cognitive taxonomy that orders questions by increasing cognitive complexity and operational impact, from L1 information retrieval through L4 advanced system management.

If this is right

Rubric-enhanced prompts raise agreement more reliably than switching between frontier models.
L3 and L4 questions show the largest gaps and therefore require human review even when lower levels can use AI.
The taxonomy supplies a concrete rule for routing items to AI or human graders.
The reported prompt templates and evaluation protocol transfer directly to other command-line or scripting assessments.
Question complexity measured by the taxonomy predicts LLM grading error across providers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Instructors could pre-screen new exam questions with the taxonomy and automatically route higher levels to humans.
The same taxonomy might help design training data that improves model performance specifically on L3 and L4 items.
If the pattern holds, departments facing enrollment growth could scale grading capacity without uniform loss of quality.
The approach offers a template for testing AI graders in any domain where partial credit and multiple correct answers are common.

Load-bearing premise

The four-level taxonomy correctly identifies the features of questions that determine how difficult they are for LLMs to grade.

What would settle it

A replication on the same responses in which a model or prompt achieves high agreement on L4 items, or in which agreement no longer declines with taxonomy level, would undermine the claim that complexity predicts grading difficulty.

Figures

Figures reproduced from arXiv: 2607.02432 by Francisco J.Rodriguez-Martinez, Lorena Otero-Cerdeira, Manuel Alonso-Carracedo, Pedro Celard, Ruben Fernandez-Boullon.

**Figure 1.** Figure 1: Research methodology. adequacy, efficiency, and conceptual understanding. The human evaluators assign numerical scores to each response, providing a baseline for comparison. This human evaluation approach with three evaluators enables the calculation of inter-rater reliability metrics, which serves to establish the consistency and objectivity of the human grading process before comparing against AI-based… view at source ↗

**Figure 2.** Figure 2: Used taxonomy to classify commands depending on its cognitive complexity and operational [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Frequency distribution of normalized item-level scores assigned by human evaluators. [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of assigned grade per taxonomy level for human evaluators. [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of assigned grade per question for Variant 1 and Variant 2 across LLM models. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of assigned grade per taxonomy level for Variant 1 and Variant 2 across LLM models. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

read the original abstract

Scalable and reliable grading of command-line examinations remains a challenge in computing education, where rising enrolments make manual marking difficult and rule-based autograders cannot handle partial credit, equivalent solutions, or syntactic variation. This paper evaluates whether four frontier Large Language Models (GPT, Claude Opus, Gemini, and GLM) can approximate expert judgment when grading short Linux/bash command responses. The study adopts a four-level cognitive taxonomy that combines cognitive complexity and operational impact, ranging from information retrieval (L1) and basic file manipulation (L2) to structural operations (L3) and advanced system management (L4). The models were tested with two prompt variants, a minimal baseline and a rubric-enhanced version, on 1200 real responses from second-year Computer Engineering students independently graded by three expert instructors. Gemini~3.0 Pro with rubric-guided prompting achieved the highest human-AI agreement (ICC(3,1) = 0.888, MAE = 0.10, Bland-Altman bias = -0.014). Agreement declined consistently as taxonomy level increased, with the largest discrepancies at higher levels. Across all models, rubric quality had a larger effect than provider choice, with structured prompts consistently improving agreement. These results show that question complexity is a reliable predictor of the difficulty LLMs face in grading accurately, and they establish a principled, taxonomy-based framework for determining which questions are suitable for AI-assisted grading and which require human review, while also providing a transferable evaluation protocol and prompt templates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical head-to-head on four LLMs grading 1200 real bash responses with a new taxonomy and shows rubric prompts help, but the drop in agreement at higher levels may track response length instead of the intended cognitive factors.

read the letter

The core result here is that Gemini 3.0 Pro with rubric prompts reaches ICC 0.888 and low MAE against three human graders on 1200 student Linux/bash answers, while agreement falls steadily from L1 to L4 and prompt structure beats model choice. They ran four frontier models on the same real exam data with two prompt variants and reported standard agreement stats plus Bland-Altman.

What the work actually adds is the four-level taxonomy (information retrieval up to advanced system management) applied systematically to decide which questions are safe for AI grading. The evaluation protocol, the multi-model comparison, and the concrete numbers on real second-year student responses are more structured than most LLM-grading pilots.

The main soft spot is the one the stress-test flagged. Higher taxonomy levels involve more complex commands that naturally produce longer or more variable answers. Without any reported control for token count, number of commands, or response length, the observed decline in ICC and MAE could be driven by surface features rather than the cognitive/operational dimensions the taxonomy claims to isolate. The abstract treats the taxonomy as the predictor but does not show the necessary stratification or regression to back that up.

This is useful for computing educators who need to know which exam questions can safely use AI assistance and which still need humans. It is not going to change general AI evaluation practice. The data and methods look solid enough on the surface to warrant referee time, though the confounding issue will need addressing in revision.

Referee Report

2 major / 2 minor

Summary. The paper evaluates four frontier LLMs (GPT, Claude Opus, Gemini, GLM) on grading 1200 real Linux/bash student responses independently marked by three expert instructors. It introduces a four-level cognitive taxonomy (L1 information retrieval to L4 advanced system management) combining cognitive complexity and operational impact, tests minimal vs. rubric-enhanced prompts, and reports that Gemini 3.0 Pro with rubric prompting yields the highest human-AI agreement (ICC(3,1)=0.888, MAE=0.10, Bland-Altman bias=-0.014), with agreement declining monotonically across taxonomy levels. The central claim is that the taxonomy reliably predicts LLM grading difficulty and supplies a transferable framework for deciding which questions suit AI-assisted grading versus human review.

Significance. If the taxonomy effect survives controls for surface features, the work supplies a large-scale empirical benchmark (1200 responses, multiple models, two prompt conditions, ICC/MAE/Bland-Altman metrics) for AI grading in computing education and a concrete, falsifiable protocol for selecting questions for automation. The explicit comparison against independent human grades and the provision of prompt templates are strengths that could be directly reused.

major comments (2)

[Results (agreement by taxonomy level)] Results section (agreement by taxonomy level): The monotonic decline in ICC/MAE from L1 to L4 is presented as evidence that the four-level taxonomy predicts LLM grading difficulty. However, the manuscript reports neither mean response length, token count, nor number of distinct commands per level, nor does it include any regression or stratification that controls for these variables. Consequently the taxonomy effect cannot be isolated from the possibility that higher-level questions simply elicit longer or more variable student answers.
[Methods (taxonomy construction)] Methods (taxonomy construction and question assignment): The taxonomy is defined by combining cognitive complexity and operational impact, yet the paper provides no inter-rater reliability statistic for how the 1200 questions were classified into L1–L4, nor any validation that the four levels are orthogonal to response length. This classification step is load-bearing for the claim that taxonomy level, rather than surface features, drives the observed grading discrepancies.

minor comments (2)

[Abstract] Abstract: The model identifier 'Gemini 3.0 Pro' should be accompanied by the precise release date or API version used, to allow exact replication.
[Methods] Notation: The paper uses ICC(3,1) without stating whether the three human raters are treated as fixed or random effects; a brief clarification in the methods would improve interpretability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas where additional controls and methodological detail would strengthen the manuscript. We address each point below and outline the revisions we will make.

read point-by-point responses

Referee: Results section (agreement by taxonomy level): The monotonic decline in ICC/MAE from L1 to L4 is presented as evidence that the four-level taxonomy predicts LLM grading difficulty. However, the manuscript reports neither mean response length, token count, nor number of distinct commands per level, nor does it include any regression or stratification that controls for these variables. Consequently the taxonomy effect cannot be isolated from the possibility that higher-level questions simply elicit longer or more variable student answers.

Authors: We agree that the absence of these controls is a limitation. The manuscript does not report mean response lengths, token counts, or command counts per level, nor any regression that isolates taxonomy level from surface features. We will add these descriptive statistics to the Results section and conduct a regression analysis with taxonomy level and response length (plus token count) as predictors of agreement metrics. This will allow us to test whether taxonomy level remains predictive after controlling for length. revision: yes
Referee: Methods (taxonomy construction and question assignment): The taxonomy is defined by combining cognitive complexity and operational impact, yet the paper provides no inter-rater reliability statistic for how the 1200 questions were classified into L1–L4, nor any validation that the four levels are orthogonal to response length. This classification step is load-bearing for the claim that taxonomy level, rather than surface features, drives the observed grading discrepancies.

Authors: The 1200 questions were classified into L1–L4 by the same three expert instructors who performed the independent grading, using the taxonomy definitions in the Methods. No inter-rater reliability was computed for the classification step. We will expand the Methods to describe the classification procedure in detail. We will also report the correlation between taxonomy level and response length as a check on orthogonality. If the original classification records permit, we will compute and report IRR; otherwise we will note this as a limitation. revision: partial

Circularity Check

0 steps flagged

No circularity; direct empirical agreement metrics on independent human grades

full rationale

The paper reports ICC, MAE, and Bland-Altman statistics comparing LLM grades directly to three independent expert human grades on 1200 student responses. The four-level taxonomy is adopted upfront from cognitive complexity and operational impact criteria and is not derived from or fitted to the observed agreement values. No equations, fitted parameters, or self-citations are used to define the taxonomy levels or to turn measured agreement into a prediction. The decline in agreement across levels is an empirical observation, not a definitional or self-referential step. This is a standard external validation design with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the validity of the newly proposed taxonomy and the assumption that expert human grades constitute an unbiased gold standard. No mathematical free parameters are fitted; the taxonomy levels are constructed for this study.

axioms (1)

domain assumption Inter-rater reliability statistics such as ICC(3,1) and Bland-Altman analysis are appropriate measures for comparing LLM and human grading.
Standard practice in educational assessment research invoked to interpret the reported agreement numbers.

invented entities (1)

Four-level cognitive taxonomy (L1 information retrieval to L4 advanced system management) no independent evidence
purpose: To classify exam questions by increasing complexity and thereby predict where LLMs will have greater grading difficulty.
The taxonomy is defined and applied within this paper without reference to prior independent validation or external datasets.

pith-pipeline@v0.9.1-grok · 5832 in / 1429 out tokens · 58457 ms · 2026-07-03T13:13:03.102695+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 29 canonical work pages

[1]

Stepgrade: Grading programming as- signmentswithcontext-awarellms, in: 2025IEEEIntegratedSTEMEducationConference (ISEC), pp. 1–6. doi:10.1109/ISEC64801.2025.11147374. Alonso-Carracedo, M., Fernandez-Boullon, R., Celard, P., Rodriguez-Martinez, F.J., Otero- Cerdeira, L.,

work page doi:10.1109/isec64801.2025.11147374 2025
[2]

URL:https://arxiv.org/abs/2607.00140,arXiv:2607.00140

Cogtax: A four-level cognitive taxonomy for command-line computing education. URL:https://arxiv.org/abs/2607.00140,arXiv:2607.00140. Anthropic,

work page arXiv
[3]

URL:https://platform.claude.com/docs/ en/docs/about-claude/models

Claude api models overview. URL:https://platform.claude.com/docs/ en/docs/about-claude/models. accessed 2026-02-19. Arshad, M., Yasmeen, M., Iqbal, A.N., Akhtar, N.,

2026
[4]

Bolgova, O., Ganguly, P., Ikram, M.F., Mavrych, V.,

doi:10.3390/app151810055. Bolgova, O., Ganguly, P., Ikram, M.F., Mavrych, V.,

work page doi:10.3390/app151810055
[5]

Medical Education Online 30, 2550751

Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders. Medical Education Online 30, 2550751. doi:10.1080/10872981.2025.2550751. Cohen, J.,

work page doi:10.1080/10872981.2025.2550751 2025
[6]

Educational and Psychological Measurement 20, 37–46

A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 37–46. doi:10.1177/001316446002000104. Cohen, J.,

work page doi:10.1177/001316446002000104
[7]

Psychological Bulletin 70, 213–220

Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin 70, 213–220. doi:10.1037/h0026256. Dorodchi, M., Dehbozorgi, N., Frevert, T.K.,

work page doi:10.1037/h0026256
[8]

Emirtekin, E., Özarslan, Y.,

doi:10.1109/FIE.2017.8190523. Emirtekin, E., Özarslan, Y.,

work page doi:10.1109/fie.2017.8190523 2017
[9]

Journal of Computer Assisted Learning 42, e70160

Automatic short-answer grading in sustainability ed- ucation: Ai–human agreement. Journal of Computer Assisted Learning 42, e70160. doi:https://doi.org/10.1002/jcal.70160. 29 Eurostat,

work page doi:10.1002/jcal.70160
[10]

URL:https://ec.europa

ICT education - a statistical overview. URL:https://ec.europa. eu/eurostat/statistics-explained/index.php?title=ICT_education_-_a_ statistical_overview. accessed: 2026-05-13. Freedman, D., Pisani, R., Purves, R.,

2026
[11]

Assessing the reliability and validity of large language models for automated assessment of student essays in higher education.arXiv:2508.02442. Google,

work page arXiv
[12]

URL:https://ai

Gemini 3 pro preview model documentation (gemini api). URL:https://ai. google.dev/gemini-api/docs/models/gemini-3-pro-preview. accessed 2026-02-19. Grandel, S., Schmidt, D.C., Leach, K.,

2026
[13]

Applying large language models to enhance the assessment of parallel functional programming assignments, in: Proceedings of the 1st International Workshop on Large Language Models for Code, Association for Computing Machinery, New York, NY, USA. p. 102–110. doi:10.1145/3643795.3648375. Jacobs, S., Jaschke, S.,

work page doi:10.1145/3643795.3648375
[14]

Evaluating the application of large language models to gen- erate feedback in programming education, in: 2024 IEEE Global Engineering Education Conference (EDUCON), pp. 1–5. doi:10.1109/EDUCON60312.2024.10578838. Johnson, G.K., Gaspar, A., Boyer, N., Bennett, C., Armitage, W.D.,

work page doi:10.1109/educon60312.2024.10578838 2024
[15]

Machanick, P.,

Investigating automatic scoring and feedback using large language models.arXiv:2405.00602. Machanick, P.,

work page arXiv
[16]

The Lancet 327, 307–310

Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet 327, 307–310. doi:https://doi.org/ 10.1016/S0140-6736(86)90837-8. Mohamed, K., Yousef, M., Medhat, W., Mohamed, E.H., Khoriba, G., Arafa, T.,

work page doi:10.1016/s0140-6736(86)90837-8
[17]

Information Systems 128, 102473

Hands-on analysis of using large language models for the auto evaluation of programming assignments. Information Systems 128, 102473. doi:https://doi.org/10.1016/j.is. 2024.102473. OpenAI,

work page doi:10.1016/j.is 2024
[18]

URL:https:// developers.openai.com/api/docs/models/gpt-5.2

GPT-5.2 model documentation (openai developers). URL:https:// developers.openai.com/api/docs/models/gpt-5.2. accessed 2026-02-19. Pack, A., Barrett, A., Escalante, J.,

2026
[19]

Ad- vanced Engineering Informatics65, 103345 (2025).https://doi.org/10.1016/j

Large language models and automated essay scoring of english language learner writing: Insights into validity and reliability. Comput- ers and Education: Artificial Intelligence 6, 100234. doi:https://doi.org/10.1016/j. caeai.2024.100234. 30 Pathak, A., Gandhi, R., Uttam, V., Ramamoorthy, A., Ghosh, P., Jindal, A.R., Verma, S., Mittal, A., Ased, A., Khatr...

work page doi:10.1016/j 2024
[20]

Rubric is all you need: Improving llm-based code evaluation with question-specific rubrics, in: Proceedings of the 2025 ACM Conference on International Computing Education Research V.1, Association for Computing Machinery, New York, NY, USA. p. 181–195. doi:10.1145/3702652.3744220. Pecuchova, J., Benko, Ľ., Drlik, M.,

work page doi:10.1145/3702652.3744220 2025
[21]

International Journal of Artificial Intelligence in Education 35, 3813–3846

Automated grading of open-ended questions in higher education using genai models. International Journal of Artificial Intelligence in Education 35, 3813–3846. doi:10.1007/s40593-025-00517-2. Pereira, A.F., Ferreira Mello, R.,

work page doi:10.1007/s40593-025-00517-2
[22]

IEEE Access 13, 113449–113460

A systematic literature review on large language models applications in computer programming teaching evaluation process. IEEE Access 13, 113449–113460. doi:10.1109/ACCESS.2025.3584060. Pillet, J.C., Larsen, K.R., Dobolyi, D., Queiroz, M., Handler, A., Arnulf, J.K., Sharma, R.,

work page doi:10.1109/access.2025.3584060 2025
[23]

Management Information Systems Quarterly , 1– 33doi:10.25300/MISQ/2025/18946

Ai-augmented content validation in behavioral research: Development and evaluation of the rater system. Management Information Systems Quarterly , 1– 33doi:10.25300/MISQ/2025/18946. Qiu, Y., Azimi, S.M., Lensky, A.,

work page doi:10.25300/misq/2025/18946 2025
[24]

arXiv:2601.10093

Mark my works autograder for programming courses. arXiv:2601.10093. Quan, J.C., Luo, C., Yang, F., Qiu, H.P.,

work page arXiv
[25]

DEStech Transactions on Social Science, Education and Human Science doi:10.12783/DTSSEHS/MESS2016/9623

Bloom’s taxonomy of educational objectives in information system courses. DEStech Transactions on Social Science, Education and Human Science doi:10.12783/DTSSEHS/MESS2016/9623. Raihan, N., Siddiq, M.L., Santos, J.C., Zampieri, M.,

work page doi:10.12783/dtssehs/mess2016/9623
[26]

1, Association for Computing Machinery, New York, NY, USA

Large language models in com- puter science education: A systematic literature review, in: Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1, Association for Computing Machinery, New York, NY, USA. p. 938–944. doi:10.1145/3641554.3701863. Shrout, P.E., Fleiss, J.L.,

work page doi:10.1145/3641554.3701863
[27]

Psychological Bulletin 86, 420–428

Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin 86, 420–428. doi:10.1037/0033-2909.86.2.420. Siek, K., Batten, J.,

work page doi:10.1037/0033-2909.86.2.420
[28]

URL: https://datavisualization.cra.org/TaulbeeSurvey/CRA_Taulbee_Survey_Report_ 2024.html

Taulbee survey 2024: Annual report. URL: https://datavisualization.cra.org/TaulbeeSurvey/CRA_Taulbee_Survey_Report_ 2024.html. accessed 2026-04-21. Spearman, C.,

2024
[29]

The American Journal of Psychology 15, 72–101

The proof and measurement of association between two things. The American Journal of Psychology 15, 72–101. doi:10.2307/1412159. Thomas, M.L., Yildirim-Erbasli, S.N., Hariharan, S.,

work page doi:10.2307/1412159
[30]

human scoring and feedback

Exploring undergraduate stu- dents’ perceptions of ai vs. human scoring and feedback. The Internet and Higher Educa- tion 68, 101052. doi:https://doi.org/10.1016/j.iheduc.2025.101052. Tseng, E.Q., Huang, P.C., Hsu, C., Wu, P.Y., Ku, C.T., Kang, Y.,

work page doi:10.1016/j.iheduc.2025.101052 2025
[31]

CodEv: An Auto- mated Grading Framework Leveraging Large Language Models for Consistent and Con- structive Feedback , in: 2024 IEEE International Conference on Big Data (BigData), IEEE 31 Computer Society, Los Alamitos, CA, USA. pp. 5442–5449. doi:10.1109/BigData62323. 2024.10825949. Wang, S., Xu, T., Li, H., Zhang, C., Liang, J., Tang, J., Yu, P.S., Wen, Q.,

work page doi:10.1109/bigdata62323 2024
[32]

Computers and Education: Artificial Intelligence 9, 100481

Evaluating large lan- guage models as raters in large-scale writing assessments: A psychometric framework for reliability and validity. Computers and Education: Artificial Intelligence 9, 100481. doi:https://doi.org/10.1016/j.caeai.2025.100481. Wangwiwattana, C., Tongvivat, Y.,

work page doi:10.1016/j.caeai.2025.100481 2025
[33]

Automating academic assessment: A large lan- guage model approach, in: 2023 7th International Conference on Information Technology (InCIT), pp. 330–334. doi:10.1109/InCIT60207.2023.10412991. Yousef, M., Mohamed, K., Medhat, W., Mohamed, E.H., Khoriba, G., Arafa, T.,

work page doi:10.1109/incit60207.2023.10412991 2023
[34]

Neural Computing and Applications 37, 1027–1040

Be- grading: large language models for enhanced feedback in programming education. Neural Computing and Applications 37, 1027–1040. doi:10.1007/s00521-024-10449-y. Zhipu AI,

work page doi:10.1007/s00521-024-10449-y
[35]

URL:https://bigmodel.cn/dev/ howuse/model

Zhipu ai open platform model overview. URL:https://bigmodel.cn/dev/ howuse/model. accessed 2026-02-19. 32

2026

[1] [1]

Stepgrade: Grading programming as- signmentswithcontext-awarellms, in: 2025IEEEIntegratedSTEMEducationConference (ISEC), pp. 1–6. doi:10.1109/ISEC64801.2025.11147374. Alonso-Carracedo, M., Fernandez-Boullon, R., Celard, P., Rodriguez-Martinez, F.J., Otero- Cerdeira, L.,

work page doi:10.1109/isec64801.2025.11147374 2025

[2] [2]

URL:https://arxiv.org/abs/2607.00140,arXiv:2607.00140

Cogtax: A four-level cognitive taxonomy for command-line computing education. URL:https://arxiv.org/abs/2607.00140,arXiv:2607.00140. Anthropic,

work page arXiv

[3] [3]

URL:https://platform.claude.com/docs/ en/docs/about-claude/models

Claude api models overview. URL:https://platform.claude.com/docs/ en/docs/about-claude/models. accessed 2026-02-19. Arshad, M., Yasmeen, M., Iqbal, A.N., Akhtar, N.,

2026

[4] [4]

Bolgova, O., Ganguly, P., Ikram, M.F., Mavrych, V.,

doi:10.3390/app151810055. Bolgova, O., Ganguly, P., Ikram, M.F., Mavrych, V.,

work page doi:10.3390/app151810055

[5] [5]

Medical Education Online 30, 2550751

Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders. Medical Education Online 30, 2550751. doi:10.1080/10872981.2025.2550751. Cohen, J.,

work page doi:10.1080/10872981.2025.2550751 2025

[6] [6]

Educational and Psychological Measurement 20, 37–46

A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 37–46. doi:10.1177/001316446002000104. Cohen, J.,

work page doi:10.1177/001316446002000104

[7] [7]

Psychological Bulletin 70, 213–220

Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin 70, 213–220. doi:10.1037/h0026256. Dorodchi, M., Dehbozorgi, N., Frevert, T.K.,

work page doi:10.1037/h0026256

[8] [8]

Emirtekin, E., Özarslan, Y.,

doi:10.1109/FIE.2017.8190523. Emirtekin, E., Özarslan, Y.,

work page doi:10.1109/fie.2017.8190523 2017

[9] [9]

Journal of Computer Assisted Learning 42, e70160

Automatic short-answer grading in sustainability ed- ucation: Ai–human agreement. Journal of Computer Assisted Learning 42, e70160. doi:https://doi.org/10.1002/jcal.70160. 29 Eurostat,

work page doi:10.1002/jcal.70160

[10] [10]

URL:https://ec.europa

ICT education - a statistical overview. URL:https://ec.europa. eu/eurostat/statistics-explained/index.php?title=ICT_education_-_a_ statistical_overview. accessed: 2026-05-13. Freedman, D., Pisani, R., Purves, R.,

2026

[11] [11]

Assessing the reliability and validity of large language models for automated assessment of student essays in higher education.arXiv:2508.02442. Google,

work page arXiv

[12] [12]

URL:https://ai

Gemini 3 pro preview model documentation (gemini api). URL:https://ai. google.dev/gemini-api/docs/models/gemini-3-pro-preview. accessed 2026-02-19. Grandel, S., Schmidt, D.C., Leach, K.,

2026

[13] [13]

Applying large language models to enhance the assessment of parallel functional programming assignments, in: Proceedings of the 1st International Workshop on Large Language Models for Code, Association for Computing Machinery, New York, NY, USA. p. 102–110. doi:10.1145/3643795.3648375. Jacobs, S., Jaschke, S.,

work page doi:10.1145/3643795.3648375

[14] [14]

Evaluating the application of large language models to gen- erate feedback in programming education, in: 2024 IEEE Global Engineering Education Conference (EDUCON), pp. 1–5. doi:10.1109/EDUCON60312.2024.10578838. Johnson, G.K., Gaspar, A., Boyer, N., Bennett, C., Armitage, W.D.,

work page doi:10.1109/educon60312.2024.10578838 2024

[15] [15]

Machanick, P.,

Investigating automatic scoring and feedback using large language models.arXiv:2405.00602. Machanick, P.,

work page arXiv

[16] [16]

The Lancet 327, 307–310

Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet 327, 307–310. doi:https://doi.org/ 10.1016/S0140-6736(86)90837-8. Mohamed, K., Yousef, M., Medhat, W., Mohamed, E.H., Khoriba, G., Arafa, T.,

work page doi:10.1016/s0140-6736(86)90837-8

[17] [17]

Information Systems 128, 102473

Hands-on analysis of using large language models for the auto evaluation of programming assignments. Information Systems 128, 102473. doi:https://doi.org/10.1016/j.is. 2024.102473. OpenAI,

work page doi:10.1016/j.is 2024

[18] [18]

URL:https:// developers.openai.com/api/docs/models/gpt-5.2

GPT-5.2 model documentation (openai developers). URL:https:// developers.openai.com/api/docs/models/gpt-5.2. accessed 2026-02-19. Pack, A., Barrett, A., Escalante, J.,

2026

[19] [19]

Ad- vanced Engineering Informatics65, 103345 (2025).https://doi.org/10.1016/j

Large language models and automated essay scoring of english language learner writing: Insights into validity and reliability. Comput- ers and Education: Artificial Intelligence 6, 100234. doi:https://doi.org/10.1016/j. caeai.2024.100234. 30 Pathak, A., Gandhi, R., Uttam, V., Ramamoorthy, A., Ghosh, P., Jindal, A.R., Verma, S., Mittal, A., Ased, A., Khatr...

work page doi:10.1016/j 2024

[20] [20]

Rubric is all you need: Improving llm-based code evaluation with question-specific rubrics, in: Proceedings of the 2025 ACM Conference on International Computing Education Research V.1, Association for Computing Machinery, New York, NY, USA. p. 181–195. doi:10.1145/3702652.3744220. Pecuchova, J., Benko, Ľ., Drlik, M.,

work page doi:10.1145/3702652.3744220 2025

[21] [21]

International Journal of Artificial Intelligence in Education 35, 3813–3846

Automated grading of open-ended questions in higher education using genai models. International Journal of Artificial Intelligence in Education 35, 3813–3846. doi:10.1007/s40593-025-00517-2. Pereira, A.F., Ferreira Mello, R.,

work page doi:10.1007/s40593-025-00517-2

[22] [22]

IEEE Access 13, 113449–113460

A systematic literature review on large language models applications in computer programming teaching evaluation process. IEEE Access 13, 113449–113460. doi:10.1109/ACCESS.2025.3584060. Pillet, J.C., Larsen, K.R., Dobolyi, D., Queiroz, M., Handler, A., Arnulf, J.K., Sharma, R.,

work page doi:10.1109/access.2025.3584060 2025

[23] [23]

Management Information Systems Quarterly , 1– 33doi:10.25300/MISQ/2025/18946

Ai-augmented content validation in behavioral research: Development and evaluation of the rater system. Management Information Systems Quarterly , 1– 33doi:10.25300/MISQ/2025/18946. Qiu, Y., Azimi, S.M., Lensky, A.,

work page doi:10.25300/misq/2025/18946 2025

[24] [24]

arXiv:2601.10093

Mark my works autograder for programming courses. arXiv:2601.10093. Quan, J.C., Luo, C., Yang, F., Qiu, H.P.,

work page arXiv

[25] [25]

DEStech Transactions on Social Science, Education and Human Science doi:10.12783/DTSSEHS/MESS2016/9623

Bloom’s taxonomy of educational objectives in information system courses. DEStech Transactions on Social Science, Education and Human Science doi:10.12783/DTSSEHS/MESS2016/9623. Raihan, N., Siddiq, M.L., Santos, J.C., Zampieri, M.,

work page doi:10.12783/dtssehs/mess2016/9623

[26] [26]

1, Association for Computing Machinery, New York, NY, USA

Large language models in com- puter science education: A systematic literature review, in: Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1, Association for Computing Machinery, New York, NY, USA. p. 938–944. doi:10.1145/3641554.3701863. Shrout, P.E., Fleiss, J.L.,

work page doi:10.1145/3641554.3701863

[27] [27]

Psychological Bulletin 86, 420–428

Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin 86, 420–428. doi:10.1037/0033-2909.86.2.420. Siek, K., Batten, J.,

work page doi:10.1037/0033-2909.86.2.420

[28] [28]

URL: https://datavisualization.cra.org/TaulbeeSurvey/CRA_Taulbee_Survey_Report_ 2024.html

Taulbee survey 2024: Annual report. URL: https://datavisualization.cra.org/TaulbeeSurvey/CRA_Taulbee_Survey_Report_ 2024.html. accessed 2026-04-21. Spearman, C.,

2024

[29] [29]

The American Journal of Psychology 15, 72–101

The proof and measurement of association between two things. The American Journal of Psychology 15, 72–101. doi:10.2307/1412159. Thomas, M.L., Yildirim-Erbasli, S.N., Hariharan, S.,

work page doi:10.2307/1412159

[30] [30]

human scoring and feedback

Exploring undergraduate stu- dents’ perceptions of ai vs. human scoring and feedback. The Internet and Higher Educa- tion 68, 101052. doi:https://doi.org/10.1016/j.iheduc.2025.101052. Tseng, E.Q., Huang, P.C., Hsu, C., Wu, P.Y., Ku, C.T., Kang, Y.,

work page doi:10.1016/j.iheduc.2025.101052 2025

[31] [31]

CodEv: An Auto- mated Grading Framework Leveraging Large Language Models for Consistent and Con- structive Feedback , in: 2024 IEEE International Conference on Big Data (BigData), IEEE 31 Computer Society, Los Alamitos, CA, USA. pp. 5442–5449. doi:10.1109/BigData62323. 2024.10825949. Wang, S., Xu, T., Li, H., Zhang, C., Liang, J., Tang, J., Yu, P.S., Wen, Q.,

work page doi:10.1109/bigdata62323 2024

[32] [32]

Computers and Education: Artificial Intelligence 9, 100481

Evaluating large lan- guage models as raters in large-scale writing assessments: A psychometric framework for reliability and validity. Computers and Education: Artificial Intelligence 9, 100481. doi:https://doi.org/10.1016/j.caeai.2025.100481. Wangwiwattana, C., Tongvivat, Y.,

work page doi:10.1016/j.caeai.2025.100481 2025

[33] [33]

Automating academic assessment: A large lan- guage model approach, in: 2023 7th International Conference on Information Technology (InCIT), pp. 330–334. doi:10.1109/InCIT60207.2023.10412991. Yousef, M., Mohamed, K., Medhat, W., Mohamed, E.H., Khoriba, G., Arafa, T.,

work page doi:10.1109/incit60207.2023.10412991 2023

[34] [34]

Neural Computing and Applications 37, 1027–1040

Be- grading: large language models for enhanced feedback in programming education. Neural Computing and Applications 37, 1027–1040. doi:10.1007/s00521-024-10449-y. Zhipu AI,

work page doi:10.1007/s00521-024-10449-y

[35] [35]

URL:https://bigmodel.cn/dev/ howuse/model

Zhipu ai open platform model overview. URL:https://bigmodel.cn/dev/ howuse/model. accessed 2026-02-19. 32

2026