LLMs as Teaching Assistants for Mathematics Exam Grading: Reliability, and Practical Usability

Aastha Sapkota; M. G. Sarwar Murshed

arxiv: 2607.01247 · v1 · pith:MQM3F6HTnew · submitted 2026-06-01 · 💻 cs.CY · cs.AI

LLMs as Teaching Assistants for Mathematics Exam Grading: Reliability, and Practical Usability

Aastha Sapkota , M. G. Sarwar Murshed This is my paper

Pith reviewed 2026-07-04 00:40 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords LLM gradingmathematics examspartial creditgrading reliabilitydiscrete mathematicsAI teaching assistantsexam assessmentprompt engineering

0 comments

The pith

Liberal partial-credit prompts reduce question-level grading errors for every LLM tested on math exams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates six current LLMs as assistants for grading an undergraduate discrete mathematics exam that requires proof and reasoning. It introduces a LIBERAL prompting policy that awards partial credit more generously for valid steps, contrasting it with a stricter BASELINE policy that demands complete explicit justification. Across all model families the liberal policy lowers average error at the individual question level when measured against human grades. The lowest question-level errors appear with one ChatGPT configuration under liberal rules, while total-score accuracy and rank-order correlation sometimes favor different settings. This approach could let instructors handle larger classes while still giving feedback on open-ended work.

Core claim

The central claim is that adopting a liberal partial-credit prompting policy reduces average question-level error relative to a baseline strict-rubric policy for every one of the six LLM configurations evaluated. ChatGPT 5.5 Thinking under the liberal policy records the lowest question-level MAE of 1.87 and RMSE of 2.53; Gemini 3.1 Pro Extended under the liberal policy records the lowest total-score MAE of 8.00 and RMSE of 10.66. The highest total-score Pearson correlation of 0.58 occurs under the baseline policy for Gemini 3.1 Pro Extended, showing that minimizing absolute deviation and preserving student rank order are distinct objectives.

What carries the argument

The LIBERAL versus BASELINE prompting policies, where the liberal version relaxes demands for complete explicit evidence to recognize valid partial reasoning.

If this is right

Liberal partial-credit prompting reduces average question-level error for every evaluated model family.
ChatGPT 5.5 Thinking (LIBERAL) achieves the lowest question-level MAE of 1.87 and RMSE of 2.53.
Gemini 3.1 Pro Extended (LIBERAL) achieves the lowest total-score MAE of 8.00 and RMSE of 10.66.
Gemini 3.1 Pro Extended (BASELINE) produces the strongest total-score Pearson correlation of 0.58, separating absolute accuracy from rank preservation.
Quantitative agreement metrics are accompanied by practical usability observations for classroom use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Instructors could combine LLM initial scores with targeted human review for the subset of answers where models still diverge most from expected standards.
The observed separation between point accuracy and correlation suggests that grading policies might need to be chosen according to whether the goal is absolute fairness or relative ranking.
Error reductions achieved here could be tested on exams from other mathematics courses to check whether the liberal-prompt benefit generalizes beyond discrete mathematics.
Wider adoption might shift instructor effort from initial scoring toward writing richer feedback on conceptual misconceptions.

Load-bearing premise

Human-assigned grades are treated as the authoritative ground truth without reported statistics on consistency among multiple human graders.

What would settle it

Re-grade the same exams independently with several human graders, then compare the typical disagreement among the humans against the typical disagreement between each LLM and a single human grader.

Figures

Figures reproduced from arXiv: 2607.01247 by Aastha Sapkota, M. G. Sarwar Murshed.

**Figure 2.** Figure 2: Average question-level MAE by model and grading policy. Lower [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Total-score MAE by model and grading policy. The total-score row is [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Average question-level exact agreement by model and grading policy. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Relative reduction in average question-level MAE from the baseline [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Open-ended mathematics exams are valuable because they assess reasoning, proof construction, algorithmic thinking, and communication of intermediate steps. They are also difficult to grade at scale because instructors must apply partial-credit rubrics consistently while giving feedback that helps students repair misconceptions. This paper evaluates six contemporary large language model (LLM) configurations, Gemini 3.1 Pro Extended, Gemini 3.5 Flash, ChatGPT 5.5 Pro Extended, ChatGPT 5.5 Thinking, Claude Pro Opus 4.7, and Claude Sonnet 4.6, as grading assistants for an undergraduate discrete mathematics examination. The study compares two grading policies. The BASELINE policy uses a stricter rubric-following prompt that emphasizes explicit evidence and complete justification. The LIBERAL policy was added after preliminary grading showed that the baseline condition sometimes applied harsh point deductions and failed to recognize valid partial reasoning. Agreement with human grading is measured at both the question and exam-total levels using mean absolute error, root mean squared error, normalized root mean squared error, Pearson correlation, and exact agreement. The results show that liberal partial-credit prompting reduces average question-level error for every evaluated model family. ChatGPT 5.5 Thinking (LIBERAL) has the lowest average question-level MAE (1.87) and RMSE (2.53), while Gemini 3.1 Pro Extended (LIBERAL) has the lowest total-score MAE (8.00) and RMSE (10.66). However, the strongest total-score Pearson correlation occurs under Gemini 3.1 Pro Extended (BASELINE) at 0.58, showing that point calibration and rank preservation remain distinct goals. We also report practical usability observations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Liberal partial-credit prompts cut error against human grades for all six models on this discrete math exam, but the human scores lack any consistency checks.

read the letter

The main things to know are that a more forgiving prompt for partial credit lowers question-level MAE and RMSE for every model family they tried, and that the human grades used as the benchmark have no reported inter-rater checks.

They ran the same set of real student answers from an undergraduate discrete mathematics exam through six current LLMs under two explicit policies. The baseline prompt sticks closely to the rubric with strict evidence requirements; the liberal version was added after early runs showed overly harsh deductions. They track agreement with five metrics at both per-question and total-score levels and note that the condition with lowest absolute error does not always give the highest correlation.

The comparison itself is direct and uses actual exam responses rather than made-up problems. Reporting the specific numbers and the fact that the pattern holds across model families gives a reader something usable to test in their own setting. The usability notes at the end are a small practical addition.

The soft spot is the ground truth. Human grades are treated as fixed without any mention of multiple graders, reconciliation steps, or even basic agreement statistics. In open-ended partial-credit grading of proofs and reasoning steps, grader differences are common, so the reported error drops could partly reflect alignment to one person's habits rather than a stable standard. The abstract also gives no sample size, question count, or statistical tests, which makes it harder to judge how reliable the averages are.

This is for instructors or researchers who want concrete numbers on current LLMs for math grading assistance. A reader testing similar tools would get a useful baseline from the model-by-model results.

It deserves peer review. The experiment is straightforward, the practical question is clear, and the policy contrast is worth following up even if the human-grading details and sample information need to be added.

Referee Report

2 major / 0 minor

Summary. The manuscript evaluates six LLM configurations (Gemini 3.1 Pro Extended, Gemini 3.5 Flash, ChatGPT 5.5 Pro Extended, ChatGPT 5.5 Thinking, Claude Pro Opus 4.7, Claude Sonnet 4.6) as grading assistants for an undergraduate discrete mathematics exam. It compares BASELINE (strict rubric) and LIBERAL (partial-credit) prompting policies, reporting agreement with human grades via MAE, RMSE, NRMSE, Pearson correlation, and exact agreement at question and total-score levels. The central result is that LIBERAL prompting reduces average question-level error for every model family, with ChatGPT 5.5 Thinking (LIBERAL) achieving the lowest question-level MAE (1.87) and RMSE (2.53), and Gemini 3.1 Pro Extended (LIBERAL) the lowest total-score MAE (8.00) and RMSE (10.66); the highest total-score Pearson correlation is 0.58 under Gemini 3.1 Pro Extended (BASELINE). Practical usability observations are also reported.

Significance. If the results hold, the work offers actionable evidence that liberal partial-credit prompting improves LLM grading consistency across model families on open-ended math exams, with a useful distinction between calibration (MAE/RMSE) and rank preservation (Pearson). This could inform prompt design for educational AI tools. The empirical comparison using standard metrics is a strength, but the significance is limited by the absence of details needed to interpret the human ground truth.

major comments (2)

[Abstract] Abstract (and results paragraph on agreement metrics): The central claim that LIBERAL prompting reduces question-level MAE/RMSE for every model family, with specific winners at MAE 1.87 etc., rests on treating human-assigned grades as authoritative ground truth. No information is provided on the number of independent human graders, inter-rater reliability (ICC, kappa, or pairwise agreement), or score reconciliation procedure. In discrete-math grading with partial credit for reasoning, grader variability is expected; without quantifying it, the reported error reductions cannot be distinguished from alignment to idiosyncratic human judgments.
[Abstract] Abstract: No sample size, number of exam questions, or statistical tests (e.g., paired t-tests or confidence intervals on the MAE/RMSE differences) are reported for the claim that LIBERAL reduces error for every model family. This leaves the numerical improvements only partially verifiable and undermines assessment of whether the reductions are robust or practically meaningful.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed feedback emphasizing the need for transparency on human grading procedures and statistical support for the reported improvements. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (and results paragraph on agreement metrics): The central claim that LIBERAL prompting reduces question-level MAE/RMSE for every model family, with specific winners at MAE 1.87 etc., rests on treating human-assigned grades as authoritative ground truth. No information is provided on the number of independent human graders, inter-rater reliability (ICC, kappa, or pairwise agreement), or score reconciliation procedure. In discrete-math grading with partial credit for reasoning, grader variability is expected; without quantifying it, the reported error reductions cannot be distinguished from alignment to idiosyncratic human judgments.

Authors: We agree that the absence of details on the human grading process limits interpretation of the results. The grading was performed by a single course instructor using a pre-defined rubric; no multiple independent graders or reconciliation procedure were employed. As a result, inter-rater reliability metrics cannot be computed. We will revise the manuscript to describe the grading procedure explicitly and to discuss this as a limitation, noting that some observed agreement may reflect alignment with the specific instructor's judgments. revision: partial
Referee: [Abstract] Abstract: No sample size, number of exam questions, or statistical tests (e.g., paired t-tests or confidence intervals on the MAE/RMSE differences) are reported for the claim that LIBERAL reduces error for every model family. This leaves the numerical improvements only partially verifiable and undermines assessment of whether the reductions are robust or practically meaningful.

Authors: We will revise the abstract and results sections to report the sample size and number of exam questions, and to include statistical tests (such as paired t-tests) along with confidence intervals on the MAE/RMSE differences between BASELINE and LIBERAL conditions. This will allow readers to assess the robustness of the observed error reductions. revision: yes

standing simulated objections not resolved

Inter-rater reliability cannot be reported because the study used grading by a single human instructor.

Circularity Check

0 steps flagged

No circularity; direct empirical comparison to external human grades

full rationale

The paper reports an empirical evaluation of LLM grading performance against human-assigned scores on a discrete mathematics exam, using standard error metrics (MAE, RMSE, Pearson correlation) at question and total-score levels. No derivation chain, equations, fitted parameters, or self-referential constructions appear in the abstract or described methods. Claims about LIBERAL vs BASELINE prompting rest on direct numerical comparisons to an external benchmark (human grades), not on quantities defined in terms of themselves or on self-citations. The work is self-contained against that external reference and exhibits none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on treating human grades as ground truth and on the single exam being representative; no numerical parameters are fitted to data.

axioms (2)

domain assumption Human grading constitutes a reliable and consistent ground truth for measuring LLM performance.
All reported agreement metrics are computed against human grades (abstract, agreement paragraph).
domain assumption The chosen undergraduate discrete mathematics exam is representative of open-ended math assessments in general.
Results are presented without qualification about exam specificity.

pith-pipeline@v0.9.1-grok · 5848 in / 1474 out tokens · 25705 ms · 2026-07-04T00:40:49.306775+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 4 canonical work pages

[1]

Towards llm-based autograd- ing for short textual answers,

J. Schneider, B. Schenk, and C. Niklaus, “Towards llm-based autograd- ing for short textual answers,”arXiv preprint arXiv:2309.11508, 2023

work page arXiv 2023
[2]

Using large language models for automated grading of student writing about science,

C. Impey, M. Wenger, N. Garuda, S. Golchin, and S. Stamer, “Using large language models for automated grading of student writing about science,”International Journal of Artificial Intelligence in Education, vol. 35, no. 4, pp. 1825–1859, 2025

2025
[3]

College exam grader using llm ai mod- els,

J. X. Lee and Y .-T. Song, “College exam grader using llm ai mod- els,” in2024 IEEE/ACIS 27th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). IEEE, 2024, pp. 282–289

2024
[4]

Automating autograding: Large language models as test suite generators for introductory pro- gramming,

U. Alkafaween, I. Albluwi, and P. Denny, “Automating autograding: Large language models as test suite generators for introductory pro- gramming,”Journal of Computer Assisted Learning, vol. 41, no. 1, p. e13100, 2025

2025
[5]

Automated grading approach for open-ended stem answers using llm,

P. Satcharattanachot and S. Usanavasin, “Automated grading approach for open-ended stem answers using llm,” in2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP). IEEE, 2025, pp. 1–6

2025
[6]

Automated feedback in math education: A com- parative analysis of llms for open-ended responses,

S. Baral, E. Worden, W.-C. Lim, Z. Luo, C. Santorelli, A. Gurung, and N. Heffernan, “Automated feedback in math education: A com- parative analysis of llms for open-ended responses,”arXiv preprint arXiv:2411.08910, 2024

work page arXiv 2024
[7]

Personalized auto-grading and feedback system for constructive geometry tasks using large language models on an online math platform,

Y . O. Lee, B. Bang, J. Lee, and S. Oh, “Personalized auto-grading and feedback system for constructive geometry tasks using large language models on an online math platform,”IEEE Access, 2026

2026
[8]

Automatic short answer grading in the llm era: Does gpt-4 with prompt engineering beat traditional models?

R. Ferreira Mello, C. Pereira Junior, L. Rodrigues, F. D. Pereira, L. Cabral, N. Costa, G. Ramalho, and D. Gasevic, “Automatic short answer grading in the llm era: Does gpt-4 with prompt engineering beat traditional models?” inProceedings of the 15th international learning analytics and knowledge conference, 2025, pp. 93–103

2025
[9]

Large language models for education: A survey and outlook,

S. Wang, T. Xu, H. Li, C. Zhang, J. Liang, J. Tang, P. S. Yu, and Q. Wen, “Large language models for education: A survey and outlook,” IEEE Signal Processing Magazine, vol. 42, no. 6, pp. 51–63, 2026

2026
[10]

LLM agents for education: Advances and applications,

Z. Chu, S. Wang, J. Xie, T. Zhu, Y . Yan, J. Ye, A. Zhong, X. Hu, J. Liang, P. S. Yuet al., “Llm agents for education: Advances and applications,” arXiv preprint arXiv:2503.11733, vol. 2, 2025

work page arXiv 2025
[11]

A llm-powered automatic grading framework with human-level guidelines optimization,

Y . Chu, H. Li, K. Yang, H. Shomer, H. Liu, Y . Copur-Gencturk, and J. Tang, “A llm-powered automatic grading framework with human-level guidelines optimization,”arXiv preprint arXiv:2410.02165, 2024

work page arXiv 2024
[12]

Llm-based automated grading with human-in-the-loop,

Y . Chu, H. Li, K. Yang, Y . Copur-Gencturk, and J. Tang, “Llm-based automated grading with human-in-the-loop,” in2025 IEEE International Conference on Teaching, Assessment, and Learning for Engineering (TALE). IEEE, 2025, pp. 1–8

2025
[13]

Grade like a human: Rethinking automated assessment with large language models,

W. Xie, J. Niu, C. J. Xue, and N. Guan, “Grade like a human: Rethinking automated assessment with large language models,” inProceedings of the International Conference on Research in Adaptive and Convergent Systems, 2025, pp. 1–8

2025

[1] [1]

Towards llm-based autograd- ing for short textual answers,

J. Schneider, B. Schenk, and C. Niklaus, “Towards llm-based autograd- ing for short textual answers,”arXiv preprint arXiv:2309.11508, 2023

work page arXiv 2023

[2] [2]

Using large language models for automated grading of student writing about science,

C. Impey, M. Wenger, N. Garuda, S. Golchin, and S. Stamer, “Using large language models for automated grading of student writing about science,”International Journal of Artificial Intelligence in Education, vol. 35, no. 4, pp. 1825–1859, 2025

2025

[3] [3]

College exam grader using llm ai mod- els,

J. X. Lee and Y .-T. Song, “College exam grader using llm ai mod- els,” in2024 IEEE/ACIS 27th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). IEEE, 2024, pp. 282–289

2024

[4] [4]

Automating autograding: Large language models as test suite generators for introductory pro- gramming,

U. Alkafaween, I. Albluwi, and P. Denny, “Automating autograding: Large language models as test suite generators for introductory pro- gramming,”Journal of Computer Assisted Learning, vol. 41, no. 1, p. e13100, 2025

2025

[5] [5]

Automated grading approach for open-ended stem answers using llm,

P. Satcharattanachot and S. Usanavasin, “Automated grading approach for open-ended stem answers using llm,” in2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP). IEEE, 2025, pp. 1–6

2025

[6] [6]

Automated feedback in math education: A com- parative analysis of llms for open-ended responses,

S. Baral, E. Worden, W.-C. Lim, Z. Luo, C. Santorelli, A. Gurung, and N. Heffernan, “Automated feedback in math education: A com- parative analysis of llms for open-ended responses,”arXiv preprint arXiv:2411.08910, 2024

work page arXiv 2024

[7] [7]

Personalized auto-grading and feedback system for constructive geometry tasks using large language models on an online math platform,

Y . O. Lee, B. Bang, J. Lee, and S. Oh, “Personalized auto-grading and feedback system for constructive geometry tasks using large language models on an online math platform,”IEEE Access, 2026

2026

[8] [8]

Automatic short answer grading in the llm era: Does gpt-4 with prompt engineering beat traditional models?

R. Ferreira Mello, C. Pereira Junior, L. Rodrigues, F. D. Pereira, L. Cabral, N. Costa, G. Ramalho, and D. Gasevic, “Automatic short answer grading in the llm era: Does gpt-4 with prompt engineering beat traditional models?” inProceedings of the 15th international learning analytics and knowledge conference, 2025, pp. 93–103

2025

[9] [9]

Large language models for education: A survey and outlook,

S. Wang, T. Xu, H. Li, C. Zhang, J. Liang, J. Tang, P. S. Yu, and Q. Wen, “Large language models for education: A survey and outlook,” IEEE Signal Processing Magazine, vol. 42, no. 6, pp. 51–63, 2026

2026

[10] [10]

LLM agents for education: Advances and applications,

Z. Chu, S. Wang, J. Xie, T. Zhu, Y . Yan, J. Ye, A. Zhong, X. Hu, J. Liang, P. S. Yuet al., “Llm agents for education: Advances and applications,” arXiv preprint arXiv:2503.11733, vol. 2, 2025

work page arXiv 2025

[11] [11]

A llm-powered automatic grading framework with human-level guidelines optimization,

Y . Chu, H. Li, K. Yang, H. Shomer, H. Liu, Y . Copur-Gencturk, and J. Tang, “A llm-powered automatic grading framework with human-level guidelines optimization,”arXiv preprint arXiv:2410.02165, 2024

work page arXiv 2024

[12] [12]

Llm-based automated grading with human-in-the-loop,

Y . Chu, H. Li, K. Yang, Y . Copur-Gencturk, and J. Tang, “Llm-based automated grading with human-in-the-loop,” in2025 IEEE International Conference on Teaching, Assessment, and Learning for Engineering (TALE). IEEE, 2025, pp. 1–8

2025

[13] [13]

Grade like a human: Rethinking automated assessment with large language models,

W. Xie, J. Niu, C. J. Xue, and N. Guan, “Grade like a human: Rethinking automated assessment with large language models,” inProceedings of the International Conference on Research in Adaptive and Convergent Systems, 2025, pp. 1–8

2025