EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions
Pith reviewed 2026-05-16 11:26 UTC · model grok-4.3
The pith
MLLMs exhibit widespread latent failures when recognizing and grading authentic university handwritten STEM solutions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The EDU-CIRCUIT-HW dataset enables simultaneous assessment of MLLM recognition fidelity and auto-grading performance on unconstrained handwritten STEM solutions using expert-verified ground truth. This evaluation uncovers an astonishing scale of latent failures within the recognized content, demonstrating that MLLMs lack sufficient reliability for applications like auto-grading in high-stakes educational settings.
What carries the argument
The EDU-CIRCUIT-HW dataset of 1,300+ authentic handwritten solutions paired with expert-verified verbatim transcriptions and grading reports, which tests upstream recognition accuracy and downstream task performance in parallel.
If this is right
- Current MLLMs are insufficiently reliable for auto-grading in high-stakes educational settings.
- Identified error patterns can be leveraged to detect and correct recognition errors preemptively.
- A hybrid system routing only 3.3% of assignments to human graders while using GPT-5.1 for the rest enhances overall robustness.
- Authentic handwritten benchmarks are required to properly evaluate MLLMs for educational understanding tasks.
Where Pith is reading between the lines
- Models may benefit from additional training on handwritten mathematical expressions and diagrams common in STEM.
- The error-pattern detection approach could be adapted to evaluate MLLMs in other domains involving complex visual reasoning.
- Hybrid AI-human systems might scale to other high-stakes assessment settings beyond this single course.
Load-bearing premise
The expert-verified verbatim transcriptions and grading reports serve as accurate ground truth that fully captures the students' intended logic and content in unconstrained handwritten solutions.
What would settle it
An independent expert re-transcription of the handwritten solutions showing that MLLM outputs match student intent at rates comparable to the provided ground truth would falsify the scale of latent failures.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) hold significant promise for revolutionizing traditional education and reducing teachers' workload. However, accurately interpreting unconstrained STEM student handwritten solutions with intertwined mathematical formulas, diagrams, and textual reasoning poses a significant challenge due to the lack of authentic and domain-specific benchmarks. Additionally, current evaluation paradigms predominantly rely on the outcomes of downstream tasks (e.g., auto-grading), which often probe only a subset of the recognized content, thereby failing to capture the MLLMs' understanding of complex handwritten logic as a whole. To bridge this gap, we release EDU-CIRCUIT-HW, a dataset consisting of 1,300+ authentic student handwritten solutions from a university-level STEM course. Utilizing the expert-verified verbatim transcriptions and grading reports of student solutions, we simultaneously evaluate various MLLMs' upstream recognition fidelity and downstream auto-grading performance. Our evaluation uncovers an astonishing scale of latent failures within MLLM-recognized student handwritten content, highlighting the models' insufficient reliability for auto-grading and other understanding-oriented applications in high-stakes educational settings. As a potential solution, we present a case study demonstrating that leveraging identified error patterns to preemptively detect and correct recognition errors, while requiring only minimal human intervention (e.g., routing 3.3% of assignments to human graders and the remainder to the GPT-5.1 grader), can effectively enhance the robustness of the deployed AI-enabled grading system. Code and dataset are available in this GitHub repo: https://gt-learning-innovation.github.io/CIRCUIT_EDU_HW_ACL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the EDU-CIRCUIT-HW dataset comprising 1,300+ authentic university-level STEM student handwritten solutions. It evaluates multiple MLLMs on upstream recognition fidelity by comparing outputs against expert-verified verbatim transcriptions and on downstream auto-grading performance. The central claim is that MLLMs exhibit an astonishing scale of latent failures in interpreting unconstrained handwritten content with intertwined formulas, diagrams, and reasoning, rendering them insufficiently reliable for high-stakes educational uses such as auto-grading. A case study demonstrates that routing only 3.3% of assignments to human graders based on detected error patterns can improve overall system robustness, with the dataset and code released publicly.
Significance. If the empirical results hold after addressing ground-truth validation, the work is significant as it supplies a realistic, domain-specific benchmark for MLLM handling of real student handwriting in STEM, an area where existing evaluations are limited. The dual upstream/downstream evaluation framework and the practical hybrid grading case study provide actionable insights for educational AI deployment. Public release of the dataset and code supports reproducibility and future work on error mitigation.
major comments (2)
- [Dataset section] Dataset section: The evaluation treats single-expert verbatim transcriptions and grading reports as unambiguous ground truth. In unconstrained university STEM handwriting, intertwined equations, sketches, and partial reasoning steps frequently admit multiple valid parsings; without reported inter-annotator agreement or multi-expert validation, apparent MLLM recognition failures may partly reflect transcription variance rather than model misunderstanding. This assumption is load-bearing for both the fidelity metrics and the downstream auto-grading claims.
- [Evaluation and Results sections] Evaluation and Results sections: The headline claim of an 'astonishing scale of latent failures' and 'insufficient reliability' requires concrete quantitative support, including per-model error rates, error distributions across the 1,300+ solutions, number of samples tested per model, and any statistical tests. The abstract-only view provides no such details, limiting verification of the scale and undermining the strength of the reliability conclusion.
minor comments (1)
- [Abstract] Abstract: The reference to 'GPT-5.1' should be clarified (e.g., whether it is a hypothetical or specific model version) to avoid confusion with current models.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the EDU-CIRCUIT-HW dataset and evaluation. We address each major comment below and describe the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [Dataset section] Dataset section: The evaluation treats single-expert verbatim transcriptions and grading reports as unambiguous ground truth. In unconstrained university STEM handwriting, intertwined equations, sketches, and partial reasoning steps frequently admit multiple valid parsings; without reported inter-annotator agreement or multi-expert validation, apparent MLLM recognition failures may partly reflect transcription variance rather than model misunderstanding. This assumption is load-bearing for both the fidelity metrics and the downstream auto-grading claims.
Authors: We agree that inter-annotator agreement (IAA) metrics would strengthen the ground-truth validation. The original transcriptions and grades were produced by a single expert instructor with more than a decade of experience teaching the specific course, who verified each item against the handwritten originals and official rubrics. To address the concern, we will add a new subsection reporting IAA on a randomly sampled subset of 200 solutions transcribed and graded independently by a second expert. We will report word-level agreement (e.g., BLEU and exact-match) for transcriptions and percentage agreement plus Cohen's kappa for grades. These results will be included in the revised Dataset section. revision: yes
-
Referee: [Evaluation and Results sections] Evaluation and Results sections: The headline claim of an 'astonishing scale of latent failures' and 'insufficient reliability' requires concrete quantitative support, including per-model error rates, error distributions across the 1,300+ solutions, number of samples tested per model, and any statistical tests. The abstract-only view provides no such details, limiting verification of the scale and undermining the strength of the reliability conclusion.
Authors: The full manuscript already reports per-model character and semantic error rates, error-type distributions across all 1,300+ solutions, and the fact that every model was evaluated on the complete set. We will revise the Results section to include an explicit summary table of these metrics, add statistical significance tests (paired Wilcoxon signed-rank tests) comparing model performances, and update the abstract to feature the key quantitative figures (e.g., average recognition error rates and the 3.3% routing threshold). revision: partial
Circularity Check
No circularity: pure empirical evaluation on external dataset
full rationale
The paper releases a new dataset of 1,300+ real student handwritten STEM solutions and performs direct empirical evaluation of MLLM recognition and auto-grading against expert-verified verbatim transcriptions plus grading reports. No derivations, equations, fitted parameters, or self-referential predictions appear in the provided text. Claims rest on straightforward comparison to independent ground truth rather than any reduction of outputs to inputs by construction. Self-citations are absent from the load-bearing evaluation steps, and the case study on error correction uses identified patterns without re-fitting or re-deriving the core metrics.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert-verified verbatim transcriptions accurately represent the handwritten content and student reasoning.
Reference graph
Works this paper leans on
-
[1]
Language model is suitable for correction of handwritten mathematical expressions recognition. InProceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 4057–4068. Philippe Gervais, Anastasiia Fadeeva, and Andrii Mak- sai. 2025. Mathwriting: A dataset for handwritten mathematical expression recognition. InProceed- i...
-
[2]
https://doi.org/10.48550/arXiv.2408.11728, http://arxiv.org/abs/2408.11728, arXiv:2408.11728 [math]
Grading assistance for a handwritten thermo- dynamics exam using artificial intelligence: An ex- ploratory study.Physical Review Physics Education Research, 20(2):020144. Tianyi Liu, Julia Chatain, Laura Kobel-Keller, Gerd Kortemeyer, Thomas Willwacher, and Mrinmaya Sachan. 2024. Ai-assisted automated short answer grading of handwritten university level m...
-
[3]
John Pavlopoulos, Vasiliki Kougia, Paraskevi Platanou, Holger Essler, and 1 others
Automated grading of students’ handwrit- ten graphs: A comparison of meta-learning and vision-large language models.arXiv preprint arXiv:2507.03056. John Pavlopoulos, Vasiliki Kougia, Paraskevi Platanou, Holger Essler, and 1 others. 2023. Detecting erro- neous handwritten byzantine text recognition. In Findings of the Conference on Empirical Methods in Na...
-
[4]
Mathbert: A pre-trained model for math- ematical formula understanding.arXiv preprint arXiv:2105.00377. Rohith Reddy Rachala and Mahesh Raveendranatha Panicker. 2022. Hand-drawn electrical circuit recog- nition using object detection and node recognition. SN Computer Science, 3(3):244. Lejla Skelic, Yan Xu, Matthew Cox, Wenjie Lu, Tao Yu, and Ruonan Han. ...
-
[5]
Icdar 2023 crohme: Competition on recog- nition of handwritten mathematical expressions. In Document Analysis and Recognition - ICDAR 2023: 17th International Conference, San José, CA, USA, August 21–26, 2023, Proceedings, Part II, page 553–565, Berlin, Heidelberg. Springer-Verlag. Haohang Xu, Chengjie Liu, Qihang Wang, Wenhao Huang, Yongjian Xu, Weiyu Ch...
-
[6]
Due to copyright restrictions, EDU-CIRCUIT-HW does not include the original problem statements, figures, or image-based hints. Researchers wishing to replicate experiments involving the original con- tent are encouraged to consult the textbook using the provided question IDs. For all questions listed in Table 7, we developed corresponding reference answer...
work page 2013
-
[7]
**Identify the Core Content**: Focus on the student’s actual work (the text, equations, diagrams, and code ) in their image
-
[8]
Students often redraw circuits to aid their analysis
**Analyze and Describe Circuit Diagrams**: This is a critical step. Students often redraw circuits to aid their analysis. You must analyze any circuit diagrams drawn by the student and describe them **only if they add new information** compared to the original problem’s diagram. * **A. For Annotations:** If the student redraws the circuit and adds labels ...
-
[9]
For code, preserve indentation and syntax
**Accurate Text Transcription**: Transcribe all other handwritten text as accurately as possible. For code, preserve indentation and syntax
-
[10]
* Use inline math with single dollar signs: ‘$formula$‘ for simple expressions
**Mathematical Formula Formatting**: * Convert all mathematical expressions, equations, and formulas to proper LaTeX syntax. * Use inline math with single dollar signs: ‘$formula$‘ for simple expressions. * Use display math with double dollar signs: ‘$$formula$$‘ for equations that should be centered on their own line. * Ensure all mathematical symbols, s...
-
[11]
**Formatting**: Use appropriate Markdown formatting (headings, lists, bold text) for clarity while preserving the student’s original organization
-
[12]
Do not include any introductory phrases or explanations about the transcription process
**Output**: Return only the final Markdown content of the student’s transcribed work. Do not include any introductory phrases or explanations about the transcription process. Figure 16: Prompt used to instruct Gemini-3-Pro-Preview, Gemini-2.5-Pro, GPT-5.1, and Claude-4.5-Sonnet to transcribe student handwritten solutions into Markdown format. Prompt for s...
-
[13]
Focus only on the handwritten content
**Identify the Student’s Ink**: Ignore printed text from the worksheet/exam paper unless the student has written over it. Focus only on the handwritten content
-
[14]
Do not describe the printed problem image
**Analyze Student-Drawn Diagrams (STRICTLY CONDITIONAL)**: * **Check First**: Did the student actually **redraw** or **annotate** a circuit diagram by hand? * **If NO**: Skip this step entirely. Do not describe the printed problem image. * **If YES (and only if new information is added)**: * **For Annotations**: If the student added labels or arrows to a ...
-
[15]
**Strict Text Transcription**: Transcribe handwritten text exactly as it appears
-
[16]
* Use inline math ‘$...$‘ and display math ‘$$...$$‘ appropriately
**Mathematical Formula Formatting**: * Convert handwritten math to proper LaTeX. * Use inline math ‘$...$‘ and display math ‘$$...$$‘ appropriately. * **Verify against the image**: Ensure every symbol in your LaTeX output exists in the student’s handwriting
-
[17]
Start directly with the content
**Output Format**: Return **only** the Markdown content of the transcription. Start directly with the content. Figure 17: Prompt used to instruct Qwen3-VL-Plus and Qwen3-VL-8B-Thinking to transcribe student handwritten solutions into Markdown format. One-shot prompt for the LLM detector to find recognition errors in the MLLM recognized text You are an exp...
-
[18]
**IGNORE Syntax Differences:** Do not report benign LaTeX variations (e.g., ‘\\frac‘ vs ‘\\dfrac‘, extra whitespaces) if the rendered mathematical meaning is identical
-
[19]
**IGNORE Minor Wording:** Do not report missing conversational filler words (e.g., "So," "Therefore") unless they change the logic. Ignore case differences (e.g., $v$ vs $V$)
-
[20]
**ABOUT Redrawn Diagrams:** If the OCR result depicts a student’s redrawn diagram, ignore layout differences unless the values, units, or directions (e.g., 2\Omega vs 3\Omega, clockwise vs counterclockwise) are clearly incorrect compared to the Ground Truth logic. ### OUTPUT FORMAT: If errors are found, list them in the following structured format: 1. Sou...
-
[21]
The differences in the crossed-out or cancelled content, unless only one of the ground truth/target’s content is crossed-out or cancelled
- [22]
-
[23]
Formating or presenting differences which won’t change the formula’s mathematical meaning/result (e.g., $j2$ vs $2j$, $3A$ vs $3(B-2)$ where it’s written that A=B-2 before, "5e^7" vs "5 e^7.0", "100mH vs 0.1H", "A = 3" vs "A -> 3", "cos 2t" vs "cos(2t)", -$\frac{1}{2/3}$ vs $\frac{-3}{2}$, etc.) should be ignored. They make the formula/sentence looks diff...
-
[24]
For EVERY applied deduction (any code M/E/C/U/NC with subtotal > 0 or non-empty ’applied’), explain *why* with short, specific reasons tied to the student’s content
-
[25]
Show a concrete EVIDENCE SNIPPET: an exact, short substring copied from the student’s Markdown that motivated the deduction (when possible)
- [26]
-
[27]
background-color:#ffd6d6;color:#b00020;font-weight:bold
Produce an EDITED MARKDOWN where ONLY the problematic parts are visually flagged **in-place**, keeping all other text untouched: - Wrap error spans with: <span style="background-color:#ffd6d6;color:#b00020;font-weight:bold"> ... </span> - Immediately after the span, append a small bracket note like: [deduct -0.03, reason: ...] - Do not move or rewrite sur...
-
[28]
If recognition_errors likely caused the previous deductions (e.g., a wrong symbol/value that is inconsistent with context and appears only once), then IGNORE those recognition-induced issues in the second grading (do NOT deduct for them)
-
[29]
If there are student_own_mistakes in the recognition-error triage, determine (a) whether the first grading properly accounted for them or (b) whether there are also recognition errors based on the previous AI grader’s outcome, the student’s Markdown solution, and, the problem information. If a reasonable student mistake was NOT considered and it materiall...
-
[30]
If neither applies materially, keep the original deductions (or return empty if none)
-
[31]
Only deduct among M/E/C/U/NC. Follow the rubric_snippet; do not exceed each code’s max deduction (caps will also be applied downstream). Provide brief, concrete justification tied to the student’s Markdown (and final answer), referencing recognition triage details where needed. --- INPUTS --- Problem statement: {problem_statement} Correct final answer (co...
work page 2000
-
[32]
[student_own_mistake] When there is no contextual support to disambiguate AND the recognized value/result is completely unreasonable or far from the expected magnitude (e.g., should be 1.2 but shows 3457; or drastic unit/number mismatch). Moreover, if the ‘‘mistake’’ appears mid-derivation but the final answer matches the correct final answer, it is empir...
-
[33]
[recognition_error] If context above and below is consistent and only a single line is wrong, label recognition_error. Also label recognition_error when a variable/value conflicts with provided diagrams or a variable never appears in the statement/diagram/description but suddenly shows up (spurious). Include possible OCR sign loss: if removing or adding a...
-
[34]
[uncertain] In the same ‘‘no-context’’ setting as (1), but the recognized number is close to or easily confusable with the expected value (e.g., 4 vs 9, faint minus, 1/x vs x). In these ambiguous cases you cannot rule out recognition vs student error, so label the issue as uncertain. Use sparingly
-
[35]
General Consistency Checks - Logical/Numerical inconsistency: symbol/number changes meaning without justification (e.g., Va=5V becomes -5V/5A). - Malformed formulas: e.g., parallel resistors written as R1 + R2; KCL/KVL sign violations. Figure 25: Prompt (Part I) for heuristically identifying potential recognition errors in transcriptions of student handwr...
-
[36]
Problem Statement: {problem_statement}
-
[37]
Correct Final Answer (for reference): {final_answer}
-
[38]
Recognized Student Solution (Markdown): {markdown_content} --- TASK --- Read the solution and identify issues. For each issue, provide: - a precise location_snippet from the Markdown, - a short diagnostic, - a tags array containing exactly one of: ["recognition_error"], ["student_own_mistake"], or ["uncertain"], - any image indices used (e.g., "#S0", "#P1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.