MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring
Pith reviewed 2026-05-18 04:17 UTC · model grok-4.3
The pith
MMTutorBench is the first benchmark built to test whether multimodal AI models can perform the core tasks of math tutoring.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that effective math tutoring requires diagnosing difficulties and guiding students step by step, and they introduce MMTutorBench as the first benchmark to measure this capability in multimodal models. The benchmark consists of 685 problems centered on key pedagogical steps, each accompanied by problem-specific rubrics for fine-grained scoring across six dimensions and organized into the three tasks of Insight Discovery, Operation Formulation, and Operation Execution. Initial evaluations of twelve models reveal performance gaps between proprietary and open-source systems, substantial distance from human tutor levels, and consistent effects from input variations such as a
What carries the argument
Problem-specific rubrics paired with each of the 685 problems that enable fine-grained scoring across six dimensions for the three tutoring tasks.
If this is right
- Proprietary multimodal models outperform open-source models on the three tutoring tasks.
- Feeding problems through OCR pipelines lowers tutoring quality compared with direct image input.
- Few-shot prompting produces only small gains on these tutoring tasks.
- Rubric-based evaluation by another large model matches human judgments reliably enough for practical use.
- Current AI systems have substantial room to improve before matching human tutor performance on insight discovery, operation formulation, and execution.
Where Pith is reading between the lines
- Developers could use the three-task breakdown to diagnose and fix specific weaknesses in tutoring models.
- The rubric approach might transfer to building similar benchmarks for tutoring in science or programming.
- Longitudinal testing on MMTutorBench could track whether new models are actually getting better at guiding students.
- Integrating the rubrics directly into training loops might produce AI tutors that better match the measured dimensions.
Load-bearing premise
The chosen 685 problems and their rubrics accurately represent the skills needed to tutor math in real classrooms.
What would settle it
An AI system that scores high on MMTutorBench yet provides ineffective help when actually tutoring real students over multiple sessions would show the benchmark misses key aspects of tutoring ability.
Figures
read the original abstract
Effective math tutoring requires not only solving problems but also diagnosing students' difficulties and guiding them step by step. While multimodal large language models (MLLMs) show promise, existing benchmarks largely overlook these tutoring skills. We introduce MMTutorBench, the first benchmark for AI math tutoring, consisting of 685 problems built around pedagogically significant key-steps. Each problem is paired with problem-specific rubrics that enable fine-grained evaluation across six dimensions, and structured into three tasks-Insight Discovery, Operation Formulation, and Operation Execution. We evaluate 12 leading MLLMs and find clear performance gaps between proprietary and open-source systems, substantial room compared to human tutors, and consistent trends across input variants: OCR pipelines degrade tutoring quality, few-shot prompting yields limited gains, and our rubric-based LLM-as-a-Judge proves highly reliable. These results highlight both the difficulty and diagnostic value of MMTutorBench for advancing AI tutoring.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MMTutorBench, the first multimodal benchmark for AI math tutoring. It consists of 685 problems built around pedagogically significant key-steps, each paired with problem-specific rubrics for fine-grained evaluation across six dimensions, and structured into three tasks: Insight Discovery, Operation Formulation, and Operation Execution. The authors evaluate 12 leading MLLMs, report performance gaps between proprietary and open-source models, substantial room relative to human tutors, and trends on input variants including OCR degradation and limited few-shot gains; they also present their rubric-based LLM-as-a-Judge as highly reliable.
Significance. If the benchmark construction, rubric validation, and judge reliability hold, this work fills a notable gap by shifting evaluation from pure problem-solving to diagnostic tutoring skills in a multimodal setting. The problem-specific rubrics and three-task structure offer diagnostic granularity that could meaningfully advance research on AI educational agents, provided the 685 problems are representative.
major comments (2)
- [§3] §3 (Benchmark Construction): The description of problem sourcing, key-step identification, and inclusion/exclusion criteria is insufficient to support the central claim that the 685 problems enable fine-grained measurement of real-world tutoring skills; without details on expert validation, student data grounding, or inter-rater reliability for rubric creation, the diagnostic value remains only partially substantiated.
- [§5] §5 (Evaluation and LLM-as-a-Judge): The claim that the rubric-based LLM-as-a-Judge 'proves highly reliable' lacks reported agreement metrics, human correlation coefficients, or validation against multiple raters; this is load-bearing for all fine-grained results across the six dimensions and the three tasks.
minor comments (2)
- [Results] Table 1 or results section: Clarify the exact distribution of problems across the three tasks and any balancing for difficulty or topic to aid reproducibility.
- [Related Work] Related Work: Add explicit comparison to prior multimodal math benchmarks (e.g., those focused solely on solving) to sharpen the novelty claim.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to provide additional clarity and supporting details where the original description was insufficient.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction): The description of problem sourcing, key-step identification, and inclusion/exclusion criteria is insufficient to support the central claim that the 685 problems enable fine-grained measurement of real-world tutoring skills; without details on expert validation, student data grounding, or inter-rater reliability for rubric creation, the diagnostic value remains only partially substantiated.
Authors: We agree that Section 3 would benefit from greater transparency. In the revised manuscript we have expanded the benchmark construction section to describe the sourcing of problems from standard K-12 mathematics curricula and publicly available educational resources, the criteria used to select pedagogically significant key-steps, and explicit inclusion/exclusion rules focused on problems that admit multimodal tutoring interactions. Rubrics were developed through iterative consultation with experienced math educators; we now report the inter-rater agreement statistics obtained during this process. The benchmark is deliberately grounded in general pedagogical principles rather than in logs from a specific student cohort, which we have clarified to avoid any implication of direct student-data grounding. revision: yes
-
Referee: [§5] §5 (Evaluation and LLM-as-a-Judge): The claim that the rubric-based LLM-as-a-Judge 'proves highly reliable' lacks reported agreement metrics, human correlation coefficients, or validation against multiple raters; this is load-bearing for all fine-grained results across the six dimensions and the three tasks.
Authors: We acknowledge that the original text relied on qualitative description. We have added quantitative validation in the revised Section 5 and a new appendix: Cohen’s kappa and Pearson correlation coefficients computed against multiple human raters on a stratified sample of 100 problem evaluations, together with per-dimension agreement figures. These metrics support the reliability claim for the six rubric dimensions and three tasks. revision: yes
Circularity Check
No significant circularity; benchmark is independently constructed
full rationale
The paper introduces MMTutorBench as a new collection of 685 problems with problem-specific rubrics and three defined tasks (Insight Discovery, Operation Formulation, Operation Execution). Evaluation is performed on external MLLMs with no fitted parameters, self-referential predictions, or derivations that reduce claims to inputs by construction. No equations, uniqueness theorems, or self-citations are invoked to justify the benchmark's validity; the contribution is the creation and application of the benchmark itself, which remains self-contained against external models.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce MMTutorBench, the first benchmark for AI math tutoring, consisting of 685 problems built around pedagogically significant key-steps... three tasks—Insight Discovery, Operation Formulation, and Operation Execution.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A meta-analysis of the relation between math anxiety and math achievement.Psychological bul- letin, 147(2):134. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit S. Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aha- roni, Nathan Lintz, Tiago Cardal Pais, Henr...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Exploring llms for predicting tutor strat- egy and student outcomes in dialogues.Preprint, arXiv:2507.06910. Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-Bastien Grill, Sabela Ramos, Edouard Yvinec, ...
-
[3]
Mimo: Unlocking the reasoning poten- tial of language model–from pretraining to posttraining,
Mimo: Unlocking the reasoning potential of language model - from pretraining to posttraining. CoRR, abs/2505.07608. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayi- heng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 oth- ers. 2...
-
[4]
Automated Candidate Generation:We em- ploy Gemini-2.5-Pro (Comanici et al., 2025), a powerful multimodal model, to analyze the full content of each video. Guided by a care- fully crafted prompt, the model is instructed to identify pivotal steps in the problem-solving process
work page 2025
-
[5]
Timestamp Pair Generation:For each piv- otal moment, the model outputs apair of pre- cise timestamps(in HH:MM:SS format) and a brief justification. This pair consists of the timestamp for thecritical stepitself and the timestamp for theimmediately preceding step. To ensure our benchmark contains a di- verse set of problems, we limit the extraction to a ma...
-
[6]
Frame Pair Extraction:The generated timestamp pairs are then used with the FFm- peg to extract the corresponding two static image frames for each identified moment
-
[7]
Each extractedpair of candidate framesis subjected to a rigorous manual review by our annotators
Manual Verification and Curation:This is the most critical and labor-intensive phase of our data construction. Each extractedpair of candidate framesis subjected to a rigorous manual review by our annotators. This pro- cess involved assessing frames against strict criteria, including image clarity, pedagogical significance, and suitability for tutoring. P...
-
[8]
Critical Step: a pivotal stage in problem solving (e.g., equation transformation, formula introduc- tion, concept explanation, etc.)
-
[9]
You must also capture the step just before the critical step, where the key equation isabout to appear but has not yet appeared. Theprevi- ous stepand thecritical stepshould be tightly connected; the only difference is whether the key equation is present
-
[10]
Limit the total toat most 5pairs—choose the most representative moments. Requirement
-
[11]
Clear Handwriting: The handwriting relevant to the step (equations, diagrams, etc.) is fully written and legible
-
[12]
Peak Clarity: The handwriting is stable and com- plete—not mid-writing, blurry, or partially shown
-
[13]
Exclude • Writing only the problem number/title
Audio-Visual Match: The narration clearly refers to the handwriting visible on screen. Exclude • Writing only the problem number/title. • Frames with blurry or incomplete handwriting. • Transition frames (e.g., board erasing). • Final boxed answers without explanation. Output Requirement Return achronological listof timestamp dictionaries with concise jus...
-
[14]
Automated Scene Segmentation:The pro- cess begins by programmatically identifying significant visual shifts in the video. We compute theStructural Similarity Index (SSIM)(Wang et al., 2004) score between every pair of consecutive frames. A poten- tial scene boundary is flagged wherever this score drops below a threshold. To ensure that only meaningful tra...
work page 2004
-
[15]
Representative Frame Extraction:Once the final set of scene boundaries is established, a single, clearrepresentative frameis ex- tracted from each resulting video segment. This transforms the video into an initial, con- densed sequence of static images that summa- rizes the visual progression of the solution
-
[16]
Manual Verification and Curation:Similar to our key-step frame extraction, this phase is crucial for data quality. Our annotation team meticulously reviews the automatically generated sequence of representative frames. Their task is to refine this sequence by remov- ing redundant images and supplementing any missing frames to repair logical or visual dis-...
-
[17]
Question-Answer Pair Extraction:Our pro- cess begins by analyzing the video transcripts. We first employ Gemini-2.0-Flash to process the subtitles and isolate the core mathematical problem-solving steps relevant to each key- step frame. Based on these extracted, concise solution steps, we then generate correspond- ing question-answer pairs. This output su...
-
[18]
Automated Rubric Generation:Based on each sample’s question-answer pair and its full set of contextual images, we use Gemini- 2.5-Pro with a structured prompt to generate scoring criteria for four specific dimensions: Insight Discovery,Operation Formulation, Operation Execution, andSolution Scope Control. These criteria are then combined with the requirem...
-
[19]
Manual Verification and Curation:The auto-generated rubrics undergo a rigorous manual verification process to ensure their precision and fairness. Our annotators metic- ulously review each scoring criterion, per- forming corrections, additions, or deletions as needed. The primary task is to ensure that the rubric’s specific dimensions (Insight Dis- covery...
-
[20]
Student’s Question:Simulate a student who is trying to move forward in the problem but is con- fused aboutwhat to do next, based on the current step. The question should sound natural and au- thentic, using first-person language like "What should I do now?" or "How do I continue from here?" Notice: the question should NOT mention any specific mathematical...
-
[21]
• Student’s current point of confusion:
Tutor’s Answer:Provide aprecise and imper- sonal responsefollowing the format: • [key detail]: extract the key detail in the stu- dent’s current state of the solution including the image and text, and explain the rationale for paying attention to this key detail. • [key operation]: Based on the key detail, pro- vide the very next critical operation the st...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.