MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring

Jiaming Su; Meng Jiang; Mengzhao Jia; Sichen Guo; Tengchao Yang; Yuanyang Liu; Zhihan Zhang

arxiv: 2510.23477 · v2 · pith:A467SIL4new · submitted 2025-10-27 · 💻 cs.CL

MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring

Tengchao Yang , Sichen Guo , Mengzhao Jia , Jiaming Su , Yuanyang Liu , Zhihan Zhang , Meng Jiang This is my paper

Pith reviewed 2026-05-18 04:17 UTC · model grok-4.3

classification 💻 cs.CL

keywords math tutoring benchmarkmultimodal large language modelsAI evaluationpedagogical rubricsinsight discoveryoperation formulationoperation executionfine-grained assessment

0 comments

The pith

MMTutorBench is the first benchmark built to test whether multimodal AI models can perform the core tasks of math tutoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MMTutorBench to fill the gap in evaluating AI systems on tutoring rather than just solving math problems. It contains 685 problems organized around pedagogically important key steps, each with custom rubrics that score responses on six dimensions. The benchmark breaks tutoring into three concrete tasks: Insight Discovery, Operation Formulation, and Operation Execution. When twelve leading multimodal models were tested, proprietary systems outperformed open-source ones but both fell well short of human tutors, and results varied with input type such as OCR versus direct images. The work shows that rubric-based judging by another model works reliably for this kind of evaluation.

Core claim

The authors establish that effective math tutoring requires diagnosing difficulties and guiding students step by step, and they introduce MMTutorBench as the first benchmark to measure this capability in multimodal models. The benchmark consists of 685 problems centered on key pedagogical steps, each accompanied by problem-specific rubrics for fine-grained scoring across six dimensions and organized into the three tasks of Insight Discovery, Operation Formulation, and Operation Execution. Initial evaluations of twelve models reveal performance gaps between proprietary and open-source systems, substantial distance from human tutor levels, and consistent effects from input variations such as a

What carries the argument

Problem-specific rubrics paired with each of the 685 problems that enable fine-grained scoring across six dimensions for the three tutoring tasks.

If this is right

Proprietary multimodal models outperform open-source models on the three tutoring tasks.
Feeding problems through OCR pipelines lowers tutoring quality compared with direct image input.
Few-shot prompting produces only small gains on these tutoring tasks.
Rubric-based evaluation by another large model matches human judgments reliably enough for practical use.
Current AI systems have substantial room to improve before matching human tutor performance on insight discovery, operation formulation, and execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could use the three-task breakdown to diagnose and fix specific weaknesses in tutoring models.
The rubric approach might transfer to building similar benchmarks for tutoring in science or programming.
Longitudinal testing on MMTutorBench could track whether new models are actually getting better at guiding students.
Integrating the rubrics directly into training loops might produce AI tutors that better match the measured dimensions.

Load-bearing premise

The chosen 685 problems and their rubrics accurately represent the skills needed to tutor math in real classrooms.

What would settle it

An AI system that scores high on MMTutorBench yet provides ineffective help when actually tutoring real students over multiple sessions would show the benchmark misses key aspects of tutoring ability.

Figures

Figures reproduced from arXiv: 2510.23477 by Jiaming Su, Meng Jiang, Mengzhao Jia, Sichen Guo, Tengchao Yang, Yuanyang Liu, Zhihan Zhang.

**Figure 1.** Figure 1: (a) Existing benchmarks usually target a single perspective, such as handwritten expression recognition or problem solving, which is insufficient for evaluating tutoring ability in real educational settings. (b) An example from MMTutorBench: we model the tutoring process in realistic classroom scenarios by taking a student’s handwritten solution attempt and help-seeking question as input. The tutoring resp… view at source ↗

**Figure 2.** Figure 2: The data curation pipeline of MMTutorBench. We start by collecting problems including both images and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Performance comparison of various models [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Error distribution for the top-performing [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Inter-judge reliability of our evaluation rubric. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of mathematical domains in our [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: An example of input images, student question and refrence answer. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: An evaluation rubric comparing a high-scoring response from Gemini-2.5-Pro with a low-scoring response [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Effective math tutoring requires not only solving problems but also diagnosing students' difficulties and guiding them step by step. While multimodal large language models (MLLMs) show promise, existing benchmarks largely overlook these tutoring skills. We introduce MMTutorBench, the first benchmark for AI math tutoring, consisting of 685 problems built around pedagogically significant key-steps. Each problem is paired with problem-specific rubrics that enable fine-grained evaluation across six dimensions, and structured into three tasks-Insight Discovery, Operation Formulation, and Operation Execution. We evaluate 12 leading MLLMs and find clear performance gaps between proprietary and open-source systems, substantial room compared to human tutors, and consistent trends across input variants: OCR pipelines degrade tutoring quality, few-shot prompting yields limited gains, and our rubric-based LLM-as-a-Judge proves highly reliable. These results highlight both the difficulty and diagnostic value of MMTutorBench for advancing AI tutoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MMTutorBench introduces a tutoring-specific multimodal math benchmark with 685 problems and custom rubrics, but its claims rest on construction details that are not yet visible.

read the letter

The main thing to know is that this paper puts forward MMTutorBench as the first benchmark aimed at AI math tutoring rather than just problem solving. It uses 685 multimodal problems organized around pedagogically key steps, split into three tasks—Insight Discovery, Operation Formulation, and Operation Execution—and scored with problem-specific rubrics across six dimensions. They run it on 12 MLLMs and report gaps between proprietary and open models, limited gains from few-shot prompting, degradation from OCR, and a claim that their rubric-based LLM judge is reliable, plus some distance from human tutor performance.

Referee Report

2 major / 2 minor

Summary. The paper introduces MMTutorBench, the first multimodal benchmark for AI math tutoring. It consists of 685 problems built around pedagogically significant key-steps, each paired with problem-specific rubrics for fine-grained evaluation across six dimensions, and structured into three tasks: Insight Discovery, Operation Formulation, and Operation Execution. The authors evaluate 12 leading MLLMs, report performance gaps between proprietary and open-source models, substantial room relative to human tutors, and trends on input variants including OCR degradation and limited few-shot gains; they also present their rubric-based LLM-as-a-Judge as highly reliable.

Significance. If the benchmark construction, rubric validation, and judge reliability hold, this work fills a notable gap by shifting evaluation from pure problem-solving to diagnostic tutoring skills in a multimodal setting. The problem-specific rubrics and three-task structure offer diagnostic granularity that could meaningfully advance research on AI educational agents, provided the 685 problems are representative.

major comments (2)

[§3] §3 (Benchmark Construction): The description of problem sourcing, key-step identification, and inclusion/exclusion criteria is insufficient to support the central claim that the 685 problems enable fine-grained measurement of real-world tutoring skills; without details on expert validation, student data grounding, or inter-rater reliability for rubric creation, the diagnostic value remains only partially substantiated.
[§5] §5 (Evaluation and LLM-as-a-Judge): The claim that the rubric-based LLM-as-a-Judge 'proves highly reliable' lacks reported agreement metrics, human correlation coefficients, or validation against multiple raters; this is load-bearing for all fine-grained results across the six dimensions and the three tasks.

minor comments (2)

[Results] Table 1 or results section: Clarify the exact distribution of problems across the three tasks and any balancing for difficulty or topic to aid reproducibility.
[Related Work] Related Work: Add explicit comparison to prior multimodal math benchmarks (e.g., those focused solely on solving) to sharpen the novelty claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to provide additional clarity and supporting details where the original description was insufficient.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The description of problem sourcing, key-step identification, and inclusion/exclusion criteria is insufficient to support the central claim that the 685 problems enable fine-grained measurement of real-world tutoring skills; without details on expert validation, student data grounding, or inter-rater reliability for rubric creation, the diagnostic value remains only partially substantiated.

Authors: We agree that Section 3 would benefit from greater transparency. In the revised manuscript we have expanded the benchmark construction section to describe the sourcing of problems from standard K-12 mathematics curricula and publicly available educational resources, the criteria used to select pedagogically significant key-steps, and explicit inclusion/exclusion rules focused on problems that admit multimodal tutoring interactions. Rubrics were developed through iterative consultation with experienced math educators; we now report the inter-rater agreement statistics obtained during this process. The benchmark is deliberately grounded in general pedagogical principles rather than in logs from a specific student cohort, which we have clarified to avoid any implication of direct student-data grounding. revision: yes
Referee: [§5] §5 (Evaluation and LLM-as-a-Judge): The claim that the rubric-based LLM-as-a-Judge 'proves highly reliable' lacks reported agreement metrics, human correlation coefficients, or validation against multiple raters; this is load-bearing for all fine-grained results across the six dimensions and the three tasks.

Authors: We acknowledge that the original text relied on qualitative description. We have added quantitative validation in the revised Section 5 and a new appendix: Cohen’s kappa and Pearson correlation coefficients computed against multiple human raters on a stratified sample of 100 problem evaluations, together with per-dimension agreement figures. These metrics support the reliability claim for the six rubric dimensions and three tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark is independently constructed

full rationale

The paper introduces MMTutorBench as a new collection of 685 problems with problem-specific rubrics and three defined tasks (Insight Discovery, Operation Formulation, Operation Execution). Evaluation is performed on external MLLMs with no fitted parameters, self-referential predictions, or derivations that reduce claims to inputs by construction. No equations, uniqueness theorems, or self-citations are invoked to justify the benchmark's validity; the contribution is the creation and application of the benchmark itself, which remains self-contained against external models.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical derivations, free parameters, or invented physical entities. Problem selection and rubric design choices are implicit but not quantified as fitted values.

pith-pipeline@v0.9.0 · 5706 in / 1108 out tokens · 32012 ms · 2026-05-18T04:17:22.493088+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce MMTutorBench, the first benchmark for AI math tutoring, consisting of 685 problems built around pedagogically significant key-steps... three tasks—Insight Discovery, Operation Formulation, and Operation Execution.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

[1]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

A meta-analysis of the relation between math anxiety and math achievement.Psychological bul- letin, 147(2):134. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit S. Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aha- roni, Nathan Lintz, Tiago Cardal Pais, Henr...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Exploring llms for predicting tutor strat- egy and student outcomes in dialogues.Preprint, arXiv:2507.06910. Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-Bastien Grill, Sabela Ramos, Edouard Yvinec, ...

work page arXiv 2025
[3]

Mimo: Unlocking the reasoning poten- tial of language model–from pretraining to posttraining,

Mimo: Unlocking the reasoning potential of language model - from pretraining to posttraining. CoRR, abs/2505.07608. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayi- heng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 oth- ers. 2...

work page arXiv 2025
[4]

Guided by a care- fully crafted prompt, the model is instructed to identify pivotal steps in the problem-solving process

Automated Candidate Generation:We em- ploy Gemini-2.5-Pro (Comanici et al., 2025), a powerful multimodal model, to analyze the full content of each video. Guided by a care- fully crafted prompt, the model is instructed to identify pivotal steps in the problem-solving process

work page 2025
[5]

This pair consists of the timestamp for thecritical stepitself and the timestamp for theimmediately preceding step

Timestamp Pair Generation:For each piv- otal moment, the model outputs apair of pre- cise timestamps(in HH:MM:SS format) and a brief justification. This pair consists of the timestamp for thecritical stepitself and the timestamp for theimmediately preceding step. To ensure our benchmark contains a di- verse set of problems, we limit the extraction to a ma...

work page
[6]

Frame Pair Extraction:The generated timestamp pairs are then used with the FFm- peg to extract the corresponding two static image frames for each identified moment

work page
[7]

Each extractedpair of candidate framesis subjected to a rigorous manual review by our annotators

Manual Verification and Curation:This is the most critical and labor-intensive phase of our data construction. Each extractedpair of candidate framesis subjected to a rigorous manual review by our annotators. This pro- cess involved assessing frames against strict criteria, including image clarity, pedagogical significance, and suitability for tutoring. P...

work page
[8]

Critical Step: a pivotal stage in problem solving (e.g., equation transformation, formula introduc- tion, concept explanation, etc.)

work page
[9]

Theprevi- ous stepand thecritical stepshould be tightly connected; the only difference is whether the key equation is present

You must also capture the step just before the critical step, where the key equation isabout to appear but has not yet appeared. Theprevi- ous stepand thecritical stepshould be tightly connected; the only difference is whether the key equation is present

work page
[10]

Requirement

Limit the total toat most 5pairs—choose the most representative moments. Requirement

work page
[11]

Clear Handwriting: The handwriting relevant to the step (equations, diagrams, etc.) is fully written and legible

work page
[12]

Peak Clarity: The handwriting is stable and com- plete—not mid-writing, blurry, or partially shown

work page
[13]

Exclude • Writing only the problem number/title

Audio-Visual Match: The narration clearly refers to the handwriting visible on screen. Exclude • Writing only the problem number/title. • Frames with blurry or incomplete handwriting. • Transition frames (e.g., board erasing). • Final boxed answers without explanation. Output Requirement Return achronological listof timestamp dictionaries with concise jus...

work page
[14]

We compute theStructural Similarity Index (SSIM)(Wang et al., 2004) score between every pair of consecutive frames

Automated Scene Segmentation:The pro- cess begins by programmatically identifying significant visual shifts in the video. We compute theStructural Similarity Index (SSIM)(Wang et al., 2004) score between every pair of consecutive frames. A poten- tial scene boundary is flagged wherever this score drops below a threshold. To ensure that only meaningful tra...

work page 2004
[15]

This transforms the video into an initial, con- densed sequence of static images that summa- rizes the visual progression of the solution

Representative Frame Extraction:Once the final set of scene boundaries is established, a single, clearrepresentative frameis ex- tracted from each resulting video segment. This transforms the video into an initial, con- densed sequence of static images that summa- rizes the visual progression of the solution

work page
[16]

Our annotation team meticulously reviews the automatically generated sequence of representative frames

Manual Verification and Curation:Similar to our key-step frame extraction, this phase is crucial for data quality. Our annotation team meticulously reviews the automatically generated sequence of representative frames. Their task is to refine this sequence by remov- ing redundant images and supplementing any missing frames to repair logical or visual dis-...

work page
[17]

We first employ Gemini-2.0-Flash to process the subtitles and isolate the core mathematical problem-solving steps relevant to each key- step frame

Question-Answer Pair Extraction:Our pro- cess begins by analyzing the video transcripts. We first employ Gemini-2.0-Flash to process the subtitles and isolate the core mathematical problem-solving steps relevant to each key- step frame. Based on these extracted, concise solution steps, we then generate correspond- ing question-answer pairs. This output su...

work page
[18]

These criteria are then combined with the requirements for two general dimen- sions (BrevityandCoherence) to create a preliminary six-dimensional rubric

Automated Rubric Generation:Based on each sample’s question-answer pair and its full set of contextual images, we use Gemini- 2.5-Pro with a structured prompt to generate scoring criteria for four specific dimensions: Insight Discovery,Operation Formulation, Operation Execution, andSolution Scope Control. These criteria are then combined with the requirem...

work page
[19]

Our annotators metic- ulously review each scoring criterion, per- forming corrections, additions, or deletions as needed

Manual Verification and Curation:The auto-generated rubrics undergo a rigorous manual verification process to ensure their precision and fairness. Our annotators metic- ulously review each scoring criterion, per- forming corrections, additions, or deletions as needed. The primary task is to ensure that the rubric’s specific dimensions (Insight Dis- covery...

work page
[20]

What should I do now?

Student’s Question:Simulate a student who is trying to move forward in the problem but is con- fused aboutwhat to do next, based on the current step. The question should sound natural and au- thentic, using first-person language like "What should I do now?" or "How do I continue from here?" Notice: the question should NOT mention any specific mathematical...

work page
[21]

• Student’s current point of confusion:

Tutor’s Answer:Provide aprecise and imper- sonal responsefollowing the format: • [key detail]: extract the key detail in the stu- dent’s current state of the solution including the image and text, and explain the rationale for paying attention to this key detail. • [key operation]: Based on the key detail, pro- vide the very next critical operation the st...

work page

[1] [1]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

A meta-analysis of the relation between math anxiety and math achievement.Psychological bul- letin, 147(2):134. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit S. Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aha- roni, Nathan Lintz, Tiago Cardal Pais, Henr...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Exploring llms for predicting tutor strat- egy and student outcomes in dialogues.Preprint, arXiv:2507.06910. Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-Bastien Grill, Sabela Ramos, Edouard Yvinec, ...

work page arXiv 2025

[3] [3]

Mimo: Unlocking the reasoning poten- tial of language model–from pretraining to posttraining,

Mimo: Unlocking the reasoning potential of language model - from pretraining to posttraining. CoRR, abs/2505.07608. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayi- heng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 oth- ers. 2...

work page arXiv 2025

[4] [4]

Guided by a care- fully crafted prompt, the model is instructed to identify pivotal steps in the problem-solving process

Automated Candidate Generation:We em- ploy Gemini-2.5-Pro (Comanici et al., 2025), a powerful multimodal model, to analyze the full content of each video. Guided by a care- fully crafted prompt, the model is instructed to identify pivotal steps in the problem-solving process

work page 2025

[5] [5]

This pair consists of the timestamp for thecritical stepitself and the timestamp for theimmediately preceding step

Timestamp Pair Generation:For each piv- otal moment, the model outputs apair of pre- cise timestamps(in HH:MM:SS format) and a brief justification. This pair consists of the timestamp for thecritical stepitself and the timestamp for theimmediately preceding step. To ensure our benchmark contains a di- verse set of problems, we limit the extraction to a ma...

work page

[6] [6]

Frame Pair Extraction:The generated timestamp pairs are then used with the FFm- peg to extract the corresponding two static image frames for each identified moment

work page

[7] [7]

Each extractedpair of candidate framesis subjected to a rigorous manual review by our annotators

Manual Verification and Curation:This is the most critical and labor-intensive phase of our data construction. Each extractedpair of candidate framesis subjected to a rigorous manual review by our annotators. This pro- cess involved assessing frames against strict criteria, including image clarity, pedagogical significance, and suitability for tutoring. P...

work page

[8] [8]

Critical Step: a pivotal stage in problem solving (e.g., equation transformation, formula introduc- tion, concept explanation, etc.)

work page

[9] [9]

Theprevi- ous stepand thecritical stepshould be tightly connected; the only difference is whether the key equation is present

You must also capture the step just before the critical step, where the key equation isabout to appear but has not yet appeared. Theprevi- ous stepand thecritical stepshould be tightly connected; the only difference is whether the key equation is present

work page

[10] [10]

Requirement

Limit the total toat most 5pairs—choose the most representative moments. Requirement

work page

[11] [11]

Clear Handwriting: The handwriting relevant to the step (equations, diagrams, etc.) is fully written and legible

work page

[12] [12]

Peak Clarity: The handwriting is stable and com- plete—not mid-writing, blurry, or partially shown

work page

[13] [13]

Exclude • Writing only the problem number/title

Audio-Visual Match: The narration clearly refers to the handwriting visible on screen. Exclude • Writing only the problem number/title. • Frames with blurry or incomplete handwriting. • Transition frames (e.g., board erasing). • Final boxed answers without explanation. Output Requirement Return achronological listof timestamp dictionaries with concise jus...

work page

[14] [14]

We compute theStructural Similarity Index (SSIM)(Wang et al., 2004) score between every pair of consecutive frames

Automated Scene Segmentation:The pro- cess begins by programmatically identifying significant visual shifts in the video. We compute theStructural Similarity Index (SSIM)(Wang et al., 2004) score between every pair of consecutive frames. A poten- tial scene boundary is flagged wherever this score drops below a threshold. To ensure that only meaningful tra...

work page 2004

[15] [15]

This transforms the video into an initial, con- densed sequence of static images that summa- rizes the visual progression of the solution

Representative Frame Extraction:Once the final set of scene boundaries is established, a single, clearrepresentative frameis ex- tracted from each resulting video segment. This transforms the video into an initial, con- densed sequence of static images that summa- rizes the visual progression of the solution

work page

[16] [16]

Our annotation team meticulously reviews the automatically generated sequence of representative frames

Manual Verification and Curation:Similar to our key-step frame extraction, this phase is crucial for data quality. Our annotation team meticulously reviews the automatically generated sequence of representative frames. Their task is to refine this sequence by remov- ing redundant images and supplementing any missing frames to repair logical or visual dis-...

work page

[17] [17]

We first employ Gemini-2.0-Flash to process the subtitles and isolate the core mathematical problem-solving steps relevant to each key- step frame

Question-Answer Pair Extraction:Our pro- cess begins by analyzing the video transcripts. We first employ Gemini-2.0-Flash to process the subtitles and isolate the core mathematical problem-solving steps relevant to each key- step frame. Based on these extracted, concise solution steps, we then generate correspond- ing question-answer pairs. This output su...

work page

[18] [18]

These criteria are then combined with the requirements for two general dimen- sions (BrevityandCoherence) to create a preliminary six-dimensional rubric

Automated Rubric Generation:Based on each sample’s question-answer pair and its full set of contextual images, we use Gemini- 2.5-Pro with a structured prompt to generate scoring criteria for four specific dimensions: Insight Discovery,Operation Formulation, Operation Execution, andSolution Scope Control. These criteria are then combined with the requirem...

work page

[19] [19]

Our annotators metic- ulously review each scoring criterion, per- forming corrections, additions, or deletions as needed

Manual Verification and Curation:The auto-generated rubrics undergo a rigorous manual verification process to ensure their precision and fairness. Our annotators metic- ulously review each scoring criterion, per- forming corrections, additions, or deletions as needed. The primary task is to ensure that the rubric’s specific dimensions (Insight Dis- covery...

work page

[20] [20]

What should I do now?

Student’s Question:Simulate a student who is trying to move forward in the problem but is con- fused aboutwhat to do next, based on the current step. The question should sound natural and au- thentic, using first-person language like "What should I do now?" or "How do I continue from here?" Notice: the question should NOT mention any specific mathematical...

work page

[21] [21]

• Student’s current point of confusion:

Tutor’s Answer:Provide aprecise and imper- sonal responsefollowing the format: • [key detail]: extract the key detail in the stu- dent’s current state of the solution including the image and text, and explain the rationale for paying attention to this key detail. • [key operation]: Based on the key detail, pro- vide the very next critical operation the st...

work page