EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

Huiru Xie; Liangliang Chen; Weiyu Sun; Ying Zhang; Yi Zeng; Yongnuo Cai

arxiv: 2602.00095 · v3 · submitted 2026-01-23 · 💻 cs.CV · cs.AI· cs.CY

EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

Weiyu Sun , Liangliang Chen , Yongnuo Cai , Huiru Xie , Yi Zeng , Ying Zhang This is my paper

Pith reviewed 2026-05-16 11:26 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CY

keywords multimodal LLMshandwritten recognitionSTEM educationauto-gradingeducational AIbenchmark datasetuniversity courses

0 comments

The pith

MLLMs exhibit widespread latent failures when recognizing and grading authentic university handwritten STEM solutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper releases the EDU-CIRCUIT-HW dataset of more than 1,300 authentic university student handwritten solutions from a STEM course. It uses expert-verified transcriptions and grading reports to test MLLMs on both accurate recognition of intertwined formulas, diagrams and text, and on auto-grading performance. The results show large numbers of unrecognized errors in the models' outputs. A reader should care because while MLLMs promise to ease teacher burdens through automated assessment, these failures mean they are not yet dependable for important educational decisions. The paper also shows one way to mitigate this by catching common error patterns and sending only a few percent of cases to human review.

Core claim

The EDU-CIRCUIT-HW dataset enables simultaneous assessment of MLLM recognition fidelity and auto-grading performance on unconstrained handwritten STEM solutions using expert-verified ground truth. This evaluation uncovers an astonishing scale of latent failures within the recognized content, demonstrating that MLLMs lack sufficient reliability for applications like auto-grading in high-stakes educational settings.

What carries the argument

The EDU-CIRCUIT-HW dataset of 1,300+ authentic handwritten solutions paired with expert-verified verbatim transcriptions and grading reports, which tests upstream recognition accuracy and downstream task performance in parallel.

If this is right

Current MLLMs are insufficiently reliable for auto-grading in high-stakes educational settings.
Identified error patterns can be leveraged to detect and correct recognition errors preemptively.
A hybrid system routing only 3.3% of assignments to human graders while using GPT-5.1 for the rest enhances overall robustness.
Authentic handwritten benchmarks are required to properly evaluate MLLMs for educational understanding tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models may benefit from additional training on handwritten mathematical expressions and diagrams common in STEM.
The error-pattern detection approach could be adapted to evaluate MLLMs in other domains involving complex visual reasoning.
Hybrid AI-human systems might scale to other high-stakes assessment settings beyond this single course.

Load-bearing premise

The expert-verified verbatim transcriptions and grading reports serve as accurate ground truth that fully captures the students' intended logic and content in unconstrained handwritten solutions.

What would settle it

An independent expert re-transcription of the handwritten solutions showing that MLLM outputs match student intent at rates comparable to the provided ground truth would falsify the scale of latent failures.

Figures

Figures reproduced from arXiv: 2602.00095 by Huiru Xie, Liangliang Chen, Weiyu Sun, Ying Zhang, Yi Zeng, Yongnuo Cai.

**Figure 3.** Figure 3: The proposed auto-grading pipeline. The “vanilla grading pipeline” in green box is a widely-used auto [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparisons of different MLLMs’ recognition error counts over 4 error categories (defined in Table [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt used to categorize recognition errors according to the taxonomy defined in Table [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 7.** Figure 7: Screenshot of a student handwritten solution [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 6.** Figure 6: Screenshot of a handwritten solution from Stu [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 8.** Figure 8: Screenshot of a student handwritten solution [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Grading report for a student submission on [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: An example in our dataset including a student’s solution on using foundations from physics, differential [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: An example in our dataset including a student’s solution on using linear algebra and complex operation [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Transcription generated by Gemini-2.5-Pro from the student handwritten solution shown in [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Expert-rectified version of the Gemini-2.5-Pro transcription from the student handwritten solution shown in [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Transcription generated by GPT-5.1 from the student handwritten solution shown in [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Expert-rectified transcription from Gemini-2.5-Pro for the student handwritten solution shown in [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Prompt used to instruct Gemini-3-Pro-Preview, Gemini-2.5-Pro, GPT-5.1, and Claude-4.5-Sonnet to transcribe student handwritten solutions into Markdown format [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Prompt used to instruct Qwen3-VL-Plus and Qwen3-VL-8B-Thinking to transcribe student handwritten solutions into Markdown format [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt used by the LLM-enabled recognition error detector to identify potential recognition errors [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

**Figure 19.** Figure 19: Recognition errors and their corresponding rectifications for the sample shown in Figure [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗

**Figure 20.** Figure 20: Explanations for each item-level recognition error presented in Figure [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗

**Figure 21.** Figure 21: Grading rubric for question P9.7-2, outlining how to identify student mistakes and assign score deductions [PITH_FULL_IMAGE:figures/full_fig_p029_21.png] view at source ↗

**Figure 22.** Figure 22: First-round prompt used by the LLM grader to generate point deductions based on detected mistakes in [PITH_FULL_IMAGE:figures/full_fig_p030_22.png] view at source ↗

**Figure 23.** Figure 23: Prompt for the second round in which the LLM grader explains the reasons for score deductions and [PITH_FULL_IMAGE:figures/full_fig_p031_23.png] view at source ↗

**Figure 24.** Figure 24: Prompt used to instruct the LLM regrader ( [PITH_FULL_IMAGE:figures/full_fig_p032_24.png] view at source ↗

**Figure 25.** Figure 25: Prompt (Part I) for heuristically identifying potential recognition errors in transcriptions of student [PITH_FULL_IMAGE:figures/full_fig_p033_25.png] view at source ↗

**Figure 26.** Figure 26: Prompt (Part II) for heuristically identifying potential recognition errors in transcriptions of student [PITH_FULL_IMAGE:figures/full_fig_p034_26.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) hold significant promise for revolutionizing traditional education and reducing teachers' workload. However, accurately interpreting unconstrained STEM student handwritten solutions with intertwined mathematical formulas, diagrams, and textual reasoning poses a significant challenge due to the lack of authentic and domain-specific benchmarks. Additionally, current evaluation paradigms predominantly rely on the outcomes of downstream tasks (e.g., auto-grading), which often probe only a subset of the recognized content, thereby failing to capture the MLLMs' understanding of complex handwritten logic as a whole. To bridge this gap, we release EDU-CIRCUIT-HW, a dataset consisting of 1,300+ authentic student handwritten solutions from a university-level STEM course. Utilizing the expert-verified verbatim transcriptions and grading reports of student solutions, we simultaneously evaluate various MLLMs' upstream recognition fidelity and downstream auto-grading performance. Our evaluation uncovers an astonishing scale of latent failures within MLLM-recognized student handwritten content, highlighting the models' insufficient reliability for auto-grading and other understanding-oriented applications in high-stakes educational settings. As a potential solution, we present a case study demonstrating that leveraging identified error patterns to preemptively detect and correct recognition errors, while requiring only minimal human intervention (e.g., routing 3.3% of assignments to human graders and the remainder to the GPT-5.1 grader), can effectively enhance the robustness of the deployed AI-enabled grading system. Code and dataset are available in this GitHub repo: https://gt-learning-innovation.github.io/CIRCUIT_EDU_HW_ACL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives us a new real-world dataset of 1300+ handwritten university STEM solutions for testing MLLM recognition and grading, but its claims about model failures depend on single-expert transcriptions that may not handle handwriting ambiguities cleanly.

read the letter

The main thing to know is that this work releases EDU-CIRCUIT-HW, a collection of authentic student handwritten solutions from a university STEM course, and uses it to measure how well MLLMs handle both upstream recognition of the messy content and downstream auto-grading. They also sketch a hybrid fix that routes only a small fraction of cases to humans based on spotted error patterns. That combination of dataset release and practical suggestion is the useful part here. The data and code are public, which helps anyone who wants to build on it or check the numbers themselves. The dual evaluation setup moves past the common habit of testing only on clean downstream tasks, so it gives a fuller picture of where the models break on real student work with formulas, diagrams, and partial reasoning. The hybrid routing idea is straightforward and shows they are thinking about deployment rather than just pointing out problems. The soft spot is the ground truth. Everything rests on expert-verified verbatim transcriptions and grades from what appears to be a single annotator. In unconstrained handwritten STEM material, equations and sketches often allow more than one reasonable parsing, so some of the reported recognition errors could be differences in how the content is read rather than outright model misunderstanding. Without inter-annotator agreement numbers or a breakdown of error types, the scale of the failures is hard to pin down precisely. The abstract calls it an astonishing scale, but the lack of those details makes the strength of that conclusion depend on how carefully the transcriptions were validated. This is for researchers working on multimodal models for education or document understanding who need benchmarks that reflect actual classroom handwriting. Readers who want concrete data on where current MLLMs fall short in high-stakes settings will find the dataset worth looking at. It deserves a serious referee because the dataset is new, the evaluation protocol is straightforward, and the practical angle is clear, even if the annotation reliability needs more attention in review.

Referee Report

2 major / 1 minor

Summary. The paper introduces the EDU-CIRCUIT-HW dataset comprising 1,300+ authentic university-level STEM student handwritten solutions. It evaluates multiple MLLMs on upstream recognition fidelity by comparing outputs against expert-verified verbatim transcriptions and on downstream auto-grading performance. The central claim is that MLLMs exhibit an astonishing scale of latent failures in interpreting unconstrained handwritten content with intertwined formulas, diagrams, and reasoning, rendering them insufficiently reliable for high-stakes educational uses such as auto-grading. A case study demonstrates that routing only 3.3% of assignments to human graders based on detected error patterns can improve overall system robustness, with the dataset and code released publicly.

Significance. If the empirical results hold after addressing ground-truth validation, the work is significant as it supplies a realistic, domain-specific benchmark for MLLM handling of real student handwriting in STEM, an area where existing evaluations are limited. The dual upstream/downstream evaluation framework and the practical hybrid grading case study provide actionable insights for educational AI deployment. Public release of the dataset and code supports reproducibility and future work on error mitigation.

major comments (2)

[Dataset section] Dataset section: The evaluation treats single-expert verbatim transcriptions and grading reports as unambiguous ground truth. In unconstrained university STEM handwriting, intertwined equations, sketches, and partial reasoning steps frequently admit multiple valid parsings; without reported inter-annotator agreement or multi-expert validation, apparent MLLM recognition failures may partly reflect transcription variance rather than model misunderstanding. This assumption is load-bearing for both the fidelity metrics and the downstream auto-grading claims.
[Evaluation and Results sections] Evaluation and Results sections: The headline claim of an 'astonishing scale of latent failures' and 'insufficient reliability' requires concrete quantitative support, including per-model error rates, error distributions across the 1,300+ solutions, number of samples tested per model, and any statistical tests. The abstract-only view provides no such details, limiting verification of the scale and undermining the strength of the reliability conclusion.

minor comments (1)

[Abstract] Abstract: The reference to 'GPT-5.1' should be clarified (e.g., whether it is a hypothetical or specific model version) to avoid confusion with current models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the EDU-CIRCUIT-HW dataset and evaluation. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Dataset section] Dataset section: The evaluation treats single-expert verbatim transcriptions and grading reports as unambiguous ground truth. In unconstrained university STEM handwriting, intertwined equations, sketches, and partial reasoning steps frequently admit multiple valid parsings; without reported inter-annotator agreement or multi-expert validation, apparent MLLM recognition failures may partly reflect transcription variance rather than model misunderstanding. This assumption is load-bearing for both the fidelity metrics and the downstream auto-grading claims.

Authors: We agree that inter-annotator agreement (IAA) metrics would strengthen the ground-truth validation. The original transcriptions and grades were produced by a single expert instructor with more than a decade of experience teaching the specific course, who verified each item against the handwritten originals and official rubrics. To address the concern, we will add a new subsection reporting IAA on a randomly sampled subset of 200 solutions transcribed and graded independently by a second expert. We will report word-level agreement (e.g., BLEU and exact-match) for transcriptions and percentage agreement plus Cohen's kappa for grades. These results will be included in the revised Dataset section. revision: yes
Referee: [Evaluation and Results sections] Evaluation and Results sections: The headline claim of an 'astonishing scale of latent failures' and 'insufficient reliability' requires concrete quantitative support, including per-model error rates, error distributions across the 1,300+ solutions, number of samples tested per model, and any statistical tests. The abstract-only view provides no such details, limiting verification of the scale and undermining the strength of the reliability conclusion.

Authors: The full manuscript already reports per-model character and semantic error rates, error-type distributions across all 1,300+ solutions, and the fact that every model was evaluated on the complete set. We will revise the Results section to include an explicit summary table of these metrics, add statistical significance tests (paired Wilcoxon signed-rank tests) comparing model performances, and update the abstract to feature the key quantitative figures (e.g., average recognition error rates and the 3.3% routing threshold). revision: partial

Circularity Check

0 steps flagged

No circularity: pure empirical evaluation on external dataset

full rationale

The paper releases a new dataset of 1,300+ real student handwritten STEM solutions and performs direct empirical evaluation of MLLM recognition and auto-grading against expert-verified verbatim transcriptions plus grading reports. No derivations, equations, fitted parameters, or self-referential predictions appear in the provided text. Claims rest on straightforward comparison to independent ground truth rather than any reduction of outputs to inputs by construction. Self-citations are absent from the load-bearing evaluation steps, and the case study on error correction uses identified patterns without re-fitting or re-deriving the core metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the dataset being representative of real-world challenges and on expert transcriptions being reliable ground truth; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Expert-verified verbatim transcriptions accurately represent the handwritten content and student reasoning.
Used as ground truth for measuring MLLM recognition fidelity and grading performance.

pith-pipeline@v0.9.0 · 5614 in / 1183 out tokens · 43664 ms · 2026-05-16T11:26:50.542004+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

InProceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 4057–4068

Language model is suitable for correction of handwritten mathematical expressions recognition. InProceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 4057–4068. Philippe Gervais, Anastasiia Fadeeva, and Andrii Mak- sai. 2025. Mathwriting: A dataset for handwritten mathematical expression recognition. InProceed- i...

work page arXiv 2023
[2]

https://doi.org/10.48550/arXiv.2408.11728, http://arxiv.org/abs/2408.11728, arXiv:2408.11728 [math]

Grading assistance for a handwritten thermo- dynamics exam using artificial intelligence: An ex- ploratory study.Physical Review Physics Education Research, 20(2):020144. Tianyi Liu, Julia Chatain, Laura Kobel-Keller, Gerd Kortemeyer, Thomas Willwacher, and Mrinmaya Sachan. 2024. Ai-assisted automated short answer grading of handwritten university level m...

work page arXiv 2024
[3]

John Pavlopoulos, Vasiliki Kougia, Paraskevi Platanou, Holger Essler, and 1 others

Automated grading of students’ handwrit- ten graphs: A comparison of meta-learning and vision-large language models.arXiv preprint arXiv:2507.03056. John Pavlopoulos, Vasiliki Kougia, Paraskevi Platanou, Holger Essler, and 1 others. 2023. Detecting erro- neous handwritten byzantine text recognition. In Findings of the Conference on Empirical Methods in Na...

work page arXiv 2023
[4]

Mathbert: A pre-trained model for mathematical formula understanding.arXiv preprint arXiv:2105.00377,

Mathbert: A pre-trained model for math- ematical formula understanding.arXiv preprint arXiv:2105.00377. Rohith Reddy Rachala and Mahesh Raveendranatha Panicker. 2022. Hand-drawn electrical circuit recog- nition using object detection and node recognition. SN Computer Science, 3(3):244. Lejla Skelic, Yan Xu, Matthew Cox, Wenjie Lu, Tao Yu, and Ruonan Han. ...

work page arXiv 2022
[5]

detect-then-rectify

Icdar 2023 crohme: Competition on recog- nition of handwritten mathematical expressions. In Document Analysis and Recognition - ICDAR 2023: 17th International Conference, San José, CA, USA, August 21–26, 2023, Proceedings, Part II, page 553–565, Berlin, Heidelberg. Springer-Verlag. Haohang Xu, Chengjie Liu, Qihang Wang, Wenhao Huang, Yongjian Xu, Weiyu Ch...

work page arXiv 2023
[6]

gold standard

Due to copyright restrictions, EDU-CIRCUIT-HW does not include the original problem statements, figures, or image-based hints. Researchers wishing to replicate experiments involving the original con- tent are encouraged to consult the textbook using the provided question IDs. For all questions listed in Table 7, we developed corresponding reference answer...

work page 2013
[7]

**Identify the Core Content**: Focus on the student’s actual work (the text, equations, diagrams, and code ) in their image

work page
[8]

Students often redraw circuits to aid their analysis

**Analyze and Describe Circuit Diagrams**: This is a critical step. Students often redraw circuits to aid their analysis. You must analyze any circuit diagrams drawn by the student and describe them **only if they add new information** compared to the original problem’s diagram. * **A. For Annotations:** If the student redraws the circuit and adds labels ...

work page
[9]

For code, preserve indentation and syntax

**Accurate Text Transcription**: Transcribe all other handwritten text as accurately as possible. For code, preserve indentation and syntax

work page
[10]

* Use inline math with single dollar signs: ‘$formula$‘ for simple expressions

**Mathematical Formula Formatting**: * Convert all mathematical expressions, equations, and formulas to proper LaTeX syntax. * Use inline math with single dollar signs: ‘$formula$‘ for simple expressions. * Use display math with double dollar signs: ‘$$formula$$‘ for equations that should be centered on their own line. * Ensure all mathematical symbols, s...

work page
[11]

**Formatting**: Use appropriate Markdown formatting (headings, lists, bold text) for clarity while preserving the student’s original organization

work page
[12]

Do not include any introductory phrases or explanations about the transcription process

**Output**: Return only the final Markdown content of the student’s transcribed work. Do not include any introductory phrases or explanations about the transcription process. Figure 16: Prompt used to instruct Gemini-3-Pro-Preview, Gemini-2.5-Pro, GPT-5.1, and Claude-4.5-Sonnet to transcribe student handwritten solutions into Markdown format. Prompt for s...

work page
[13]

Focus only on the handwritten content

**Identify the Student’s Ink**: Ignore printed text from the worksheet/exam paper unless the student has written over it. Focus only on the handwritten content

work page
[14]

Do not describe the printed problem image

**Analyze Student-Drawn Diagrams (STRICTLY CONDITIONAL)**: * **Check First**: Did the student actually **redraw** or **annotate** a circuit diagram by hand? * **If NO**: Skip this step entirely. Do not describe the printed problem image. * **If YES (and only if new information is added)**: * **For Annotations**: If the student added labels or arrows to a ...

work page
[15]

**Strict Text Transcription**: Transcribe handwritten text exactly as it appears

work page
[16]

* Use inline math ‘$...$‘ and display math ‘$$...$$‘ appropriately

**Mathematical Formula Formatting**: * Convert handwritten math to proper LaTeX. * Use inline math ‘$...$‘ and display math ‘$$...$$‘ appropriately. * **Verify against the image**: Ensure every symbol in your LaTeX output exists in the student’s handwriting

work page
[17]

Start directly with the content

**Output Format**: Return **only** the Markdown content of the transcription. Start directly with the content. Figure 17: Prompt used to instruct Qwen3-VL-Plus and Qwen3-VL-8B-Thinking to transcribe student handwritten solutions into Markdown format. One-shot prompt for the LLM detector to find recognition errors in the MLLM recognized text You are an exp...

work page
[18]

**IGNORE Syntax Differences:** Do not report benign LaTeX variations (e.g., ‘\\frac‘ vs ‘\\dfrac‘, extra whitespaces) if the rendered mathematical meaning is identical

work page
[19]

So," "Therefore

**IGNORE Minor Wording:** Do not report missing conversational filler words (e.g., "So," "Therefore") unless they change the logic. Ignore case differences (e.g., $v$ vs $V$)

work page
[20]

No significant errors found

**ABOUT Redrawn Diagrams:** If the OCR result depicts a student’s redrawn diagram, ignore layout differences unless the values, units, or directions (e.g., 2\Omega vs 3\Omega, clockwise vs counterclockwise) are clearly incorrect compared to the Ground Truth logic. ### OUTPUT FORMAT: If errors are found, list them in the following structured format: 1. Sou...

work page
[21]

The differences in the crossed-out or cancelled content, unless only one of the ground truth/target’s content is crossed-out or cancelled

work page
[22]

\int tdt

Capitalization differences (e.g., $v$ vs $V$, $v_0$ vs $V_o$, $w$ and $W$) **unless** they break the formula’s mathematical meaning (e.g., both $v$ vs $V$ expressing the same physical quantity in a single formula like $dv/dt + V = 0$, "\int tdt" vs "\intt\tau", etc.)

work page
[23]

5e^7" vs

Formating or presenting differences which won’t change the formula’s mathematical meaning/result (e.g., $j2$ vs $2j$, $3A$ vs $3(B-2)$ where it’s written that A=B-2 before, "5e^7" vs "5 e^7.0", "100mH vs 0.1H", "A = 3" vs "A -> 3", "cos 2t" vs "cos(2t)", -$\frac{1}{2/3}$ vs $\frac{-3}{2}$, etc.) should be ignored. They make the formula/sentence looks diff...

work page
[24]

For EVERY applied deduction (any code M/E/C/U/NC with subtotal > 0 or non-empty ’applied’), explain *why* with short, specific reasons tied to the student’s content

work page
[25]

Show a concrete EVIDENCE SNIPPET: an exact, short substring copied from the student’s Markdown that motivated the deduction (when possible)

work page
[26]

high" or

Provide a single OVERALL CONFIDENCE label: "high" or "low". Use "low" when OCR noise/ambiguity/insufficient evidence could materially affect the scoring; otherwise "high"

work page
[27]

background-color:#ffd6d6;color:#b00020;font-weight:bold

Produce an EDITED MARKDOWN where ONLY the problematic parts are visually flagged **in-place**, keeping all other text untouched: - Wrap error spans with: <span style="background-color:#ffd6d6;color:#b00020;font-weight:bold"> ... </span> - Immediately after the span, append a small bracket note like: [deduct -0.03, reason: ...] - Do not move or rewrite sur...

work page
[28]

If recognition_errors likely caused the previous deductions (e.g., a wrong symbol/value that is inconsistent with context and appears only once), then IGNORE those recognition-induced issues in the second grading (do NOT deduct for them)

work page
[29]

If a reasonable student mistake was NOT considered and it materially affects correctness *be conservative when making this judgement*, deduct accordingly this time

If there are student_own_mistakes in the recognition-error triage, determine (a) whether the first grading properly accounted for them or (b) whether there are also recognition errors based on the previous AI grader’s outcome, the student’s Markdown solution, and, the problem information. If a reasonable student mistake was NOT considered and it materiall...

work page
[30]

If neither applies materially, keep the original deductions (or return empty if none)

work page
[31]

per_code

Only deduct among M/E/C/U/NC. Follow the rubric_snippet; do not exceed each code’s max deduction (caps will also be applied downstream). Provide brief, concrete justification tied to the student’s Markdown (and final answer), referencing recognition triage details where needed. --- INPUTS --- Problem statement: {problem_statement} Correct final answer (co...

work page 2000
[32]

Moreover, if the ‘‘mistake’’ appears mid-derivation but the final answer matches the correct final answer, it is empirically not a student’s own mistake

[student_own_mistake] When there is no contextual support to disambiguate AND the recognized value/result is completely unreasonable or far from the expected magnitude (e.g., should be 1.2 but shows 3457; or drastic unit/number mismatch). Moreover, if the ‘‘mistake’’ appears mid-derivation but the final answer matches the correct final answer, it is empir...

work page
[33]

[recognition_error] If context above and below is consistent and only a single line is wrong, label recognition_error. Also label recognition_error when a variable/value conflicts with provided diagrams or a variable never appears in the statement/diagram/description but suddenly shows up (spurious). Include possible OCR sign loss: if removing or adding a...

work page
[34]

In these ambiguous cases you cannot rule out recognition vs student error, so label the issue as uncertain

[uncertain] In the same ‘‘no-context’’ setting as (1), but the recognized number is close to or easily confusable with the expected value (e.g., 4 vs 9, faint minus, 1/x vs x). In these ambiguous cases you cannot rule out recognition vs student error, so label the issue as uncertain. Use sparingly

work page
[35]

uncertain

General Consistency Checks - Logical/Numerical inconsistency: symbol/number changes meaning without justification (e.g., Va=5V becomes -5V/5A). - Malformed formulas: e.g., parallel resistors written as R1 + R2; KCL/KVL sign violations. Figure 25: Prompt (Part I) for heuristically identifying potential recognition errors in transcriptions of student handwr...

work page
[36]

Problem Statement: {problem_statement}

work page
[37]

Correct Final Answer (for reference): {final_answer}

work page
[38]

recognition_error

Recognized Student Solution (Markdown): {markdown_content} --- TASK --- Read the solution and identify issues. For each issue, provide: - a precise location_snippet from the Markdown, - a short diagnostic, - a tags array containing exactly one of: ["recognition_error"], ["student_own_mistake"], or ["uncertain"], - any image indices used (e.g., "#S0", "#P1...

work page

[1] [1]

InProceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 4057–4068

Language model is suitable for correction of handwritten mathematical expressions recognition. InProceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 4057–4068. Philippe Gervais, Anastasiia Fadeeva, and Andrii Mak- sai. 2025. Mathwriting: A dataset for handwritten mathematical expression recognition. InProceed- i...

work page arXiv 2023

[2] [2]

https://doi.org/10.48550/arXiv.2408.11728, http://arxiv.org/abs/2408.11728, arXiv:2408.11728 [math]

Grading assistance for a handwritten thermo- dynamics exam using artificial intelligence: An ex- ploratory study.Physical Review Physics Education Research, 20(2):020144. Tianyi Liu, Julia Chatain, Laura Kobel-Keller, Gerd Kortemeyer, Thomas Willwacher, and Mrinmaya Sachan. 2024. Ai-assisted automated short answer grading of handwritten university level m...

work page arXiv 2024

[3] [3]

John Pavlopoulos, Vasiliki Kougia, Paraskevi Platanou, Holger Essler, and 1 others

Automated grading of students’ handwrit- ten graphs: A comparison of meta-learning and vision-large language models.arXiv preprint arXiv:2507.03056. John Pavlopoulos, Vasiliki Kougia, Paraskevi Platanou, Holger Essler, and 1 others. 2023. Detecting erro- neous handwritten byzantine text recognition. In Findings of the Conference on Empirical Methods in Na...

work page arXiv 2023

[4] [4]

Mathbert: A pre-trained model for mathematical formula understanding.arXiv preprint arXiv:2105.00377,

Mathbert: A pre-trained model for math- ematical formula understanding.arXiv preprint arXiv:2105.00377. Rohith Reddy Rachala and Mahesh Raveendranatha Panicker. 2022. Hand-drawn electrical circuit recog- nition using object detection and node recognition. SN Computer Science, 3(3):244. Lejla Skelic, Yan Xu, Matthew Cox, Wenjie Lu, Tao Yu, and Ruonan Han. ...

work page arXiv 2022

[5] [5]

detect-then-rectify

Icdar 2023 crohme: Competition on recog- nition of handwritten mathematical expressions. In Document Analysis and Recognition - ICDAR 2023: 17th International Conference, San José, CA, USA, August 21–26, 2023, Proceedings, Part II, page 553–565, Berlin, Heidelberg. Springer-Verlag. Haohang Xu, Chengjie Liu, Qihang Wang, Wenhao Huang, Yongjian Xu, Weiyu Ch...

work page arXiv 2023

[6] [6]

gold standard

Due to copyright restrictions, EDU-CIRCUIT-HW does not include the original problem statements, figures, or image-based hints. Researchers wishing to replicate experiments involving the original con- tent are encouraged to consult the textbook using the provided question IDs. For all questions listed in Table 7, we developed corresponding reference answer...

work page 2013

[7] [7]

**Identify the Core Content**: Focus on the student’s actual work (the text, equations, diagrams, and code ) in their image

work page

[8] [8]

Students often redraw circuits to aid their analysis

**Analyze and Describe Circuit Diagrams**: This is a critical step. Students often redraw circuits to aid their analysis. You must analyze any circuit diagrams drawn by the student and describe them **only if they add new information** compared to the original problem’s diagram. * **A. For Annotations:** If the student redraws the circuit and adds labels ...

work page

[9] [9]

For code, preserve indentation and syntax

**Accurate Text Transcription**: Transcribe all other handwritten text as accurately as possible. For code, preserve indentation and syntax

work page

[10] [10]

* Use inline math with single dollar signs: ‘$formula$‘ for simple expressions

**Mathematical Formula Formatting**: * Convert all mathematical expressions, equations, and formulas to proper LaTeX syntax. * Use inline math with single dollar signs: ‘$formula$‘ for simple expressions. * Use display math with double dollar signs: ‘$$formula$$‘ for equations that should be centered on their own line. * Ensure all mathematical symbols, s...

work page

[11] [11]

**Formatting**: Use appropriate Markdown formatting (headings, lists, bold text) for clarity while preserving the student’s original organization

work page

[12] [12]

Do not include any introductory phrases or explanations about the transcription process

**Output**: Return only the final Markdown content of the student’s transcribed work. Do not include any introductory phrases or explanations about the transcription process. Figure 16: Prompt used to instruct Gemini-3-Pro-Preview, Gemini-2.5-Pro, GPT-5.1, and Claude-4.5-Sonnet to transcribe student handwritten solutions into Markdown format. Prompt for s...

work page

[13] [13]

Focus only on the handwritten content

**Identify the Student’s Ink**: Ignore printed text from the worksheet/exam paper unless the student has written over it. Focus only on the handwritten content

work page

[14] [14]

Do not describe the printed problem image

**Analyze Student-Drawn Diagrams (STRICTLY CONDITIONAL)**: * **Check First**: Did the student actually **redraw** or **annotate** a circuit diagram by hand? * **If NO**: Skip this step entirely. Do not describe the printed problem image. * **If YES (and only if new information is added)**: * **For Annotations**: If the student added labels or arrows to a ...

work page

[15] [15]

**Strict Text Transcription**: Transcribe handwritten text exactly as it appears

work page

[16] [16]

* Use inline math ‘$...$‘ and display math ‘$$...$$‘ appropriately

**Mathematical Formula Formatting**: * Convert handwritten math to proper LaTeX. * Use inline math ‘$...$‘ and display math ‘$$...$$‘ appropriately. * **Verify against the image**: Ensure every symbol in your LaTeX output exists in the student’s handwriting

work page

[17] [17]

Start directly with the content

**Output Format**: Return **only** the Markdown content of the transcription. Start directly with the content. Figure 17: Prompt used to instruct Qwen3-VL-Plus and Qwen3-VL-8B-Thinking to transcribe student handwritten solutions into Markdown format. One-shot prompt for the LLM detector to find recognition errors in the MLLM recognized text You are an exp...

work page

[18] [18]

**IGNORE Syntax Differences:** Do not report benign LaTeX variations (e.g., ‘\\frac‘ vs ‘\\dfrac‘, extra whitespaces) if the rendered mathematical meaning is identical

work page

[19] [19]

So," "Therefore

**IGNORE Minor Wording:** Do not report missing conversational filler words (e.g., "So," "Therefore") unless they change the logic. Ignore case differences (e.g., $v$ vs $V$)

work page

[20] [20]

No significant errors found

**ABOUT Redrawn Diagrams:** If the OCR result depicts a student’s redrawn diagram, ignore layout differences unless the values, units, or directions (e.g., 2\Omega vs 3\Omega, clockwise vs counterclockwise) are clearly incorrect compared to the Ground Truth logic. ### OUTPUT FORMAT: If errors are found, list them in the following structured format: 1. Sou...

work page

[21] [21]

The differences in the crossed-out or cancelled content, unless only one of the ground truth/target’s content is crossed-out or cancelled

work page

[22] [22]

\int tdt

Capitalization differences (e.g., $v$ vs $V$, $v_0$ vs $V_o$, $w$ and $W$) **unless** they break the formula’s mathematical meaning (e.g., both $v$ vs $V$ expressing the same physical quantity in a single formula like $dv/dt + V = 0$, "\int tdt" vs "\intt\tau", etc.)

work page

[23] [23]

5e^7" vs

Formating or presenting differences which won’t change the formula’s mathematical meaning/result (e.g., $j2$ vs $2j$, $3A$ vs $3(B-2)$ where it’s written that A=B-2 before, "5e^7" vs "5 e^7.0", "100mH vs 0.1H", "A = 3" vs "A -> 3", "cos 2t" vs "cos(2t)", -$\frac{1}{2/3}$ vs $\frac{-3}{2}$, etc.) should be ignored. They make the formula/sentence looks diff...

work page

[24] [24]

For EVERY applied deduction (any code M/E/C/U/NC with subtotal > 0 or non-empty ’applied’), explain *why* with short, specific reasons tied to the student’s content

work page

[25] [25]

Show a concrete EVIDENCE SNIPPET: an exact, short substring copied from the student’s Markdown that motivated the deduction (when possible)

work page

[26] [26]

high" or

Provide a single OVERALL CONFIDENCE label: "high" or "low". Use "low" when OCR noise/ambiguity/insufficient evidence could materially affect the scoring; otherwise "high"

work page

[27] [27]

background-color:#ffd6d6;color:#b00020;font-weight:bold

Produce an EDITED MARKDOWN where ONLY the problematic parts are visually flagged **in-place**, keeping all other text untouched: - Wrap error spans with: <span style="background-color:#ffd6d6;color:#b00020;font-weight:bold"> ... </span> - Immediately after the span, append a small bracket note like: [deduct -0.03, reason: ...] - Do not move or rewrite sur...

work page

[28] [28]

If recognition_errors likely caused the previous deductions (e.g., a wrong symbol/value that is inconsistent with context and appears only once), then IGNORE those recognition-induced issues in the second grading (do NOT deduct for them)

work page

[29] [29]

If a reasonable student mistake was NOT considered and it materially affects correctness *be conservative when making this judgement*, deduct accordingly this time

If there are student_own_mistakes in the recognition-error triage, determine (a) whether the first grading properly accounted for them or (b) whether there are also recognition errors based on the previous AI grader’s outcome, the student’s Markdown solution, and, the problem information. If a reasonable student mistake was NOT considered and it materiall...

work page

[30] [30]

If neither applies materially, keep the original deductions (or return empty if none)

work page

[31] [31]

per_code

Only deduct among M/E/C/U/NC. Follow the rubric_snippet; do not exceed each code’s max deduction (caps will also be applied downstream). Provide brief, concrete justification tied to the student’s Markdown (and final answer), referencing recognition triage details where needed. --- INPUTS --- Problem statement: {problem_statement} Correct final answer (co...

work page 2000

[32] [32]

Moreover, if the ‘‘mistake’’ appears mid-derivation but the final answer matches the correct final answer, it is empirically not a student’s own mistake

[student_own_mistake] When there is no contextual support to disambiguate AND the recognized value/result is completely unreasonable or far from the expected magnitude (e.g., should be 1.2 but shows 3457; or drastic unit/number mismatch). Moreover, if the ‘‘mistake’’ appears mid-derivation but the final answer matches the correct final answer, it is empir...

work page

[33] [33]

[recognition_error] If context above and below is consistent and only a single line is wrong, label recognition_error. Also label recognition_error when a variable/value conflicts with provided diagrams or a variable never appears in the statement/diagram/description but suddenly shows up (spurious). Include possible OCR sign loss: if removing or adding a...

work page

[34] [34]

In these ambiguous cases you cannot rule out recognition vs student error, so label the issue as uncertain

[uncertain] In the same ‘‘no-context’’ setting as (1), but the recognized number is close to or easily confusable with the expected value (e.g., 4 vs 9, faint minus, 1/x vs x). In these ambiguous cases you cannot rule out recognition vs student error, so label the issue as uncertain. Use sparingly

work page

[35] [35]

uncertain

General Consistency Checks - Logical/Numerical inconsistency: symbol/number changes meaning without justification (e.g., Va=5V becomes -5V/5A). - Malformed formulas: e.g., parallel resistors written as R1 + R2; KCL/KVL sign violations. Figure 25: Prompt (Part I) for heuristically identifying potential recognition errors in transcriptions of student handwr...

work page

[36] [36]

Problem Statement: {problem_statement}

work page

[37] [37]

Correct Final Answer (for reference): {final_answer}

work page

[38] [38]

recognition_error

Recognized Student Solution (Markdown): {markdown_content} --- TASK --- Read the solution and identify issues. For each issue, provide: - a precise location_snippet from the Markdown, - a short diagnostic, - a tags array containing exactly one of: ["recognition_error"], ["student_own_mistake"], or ["uncertain"], - any image indices used (e.g., "#S0", "#P1...

work page