Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model-based Rubric Study
Pith reviewed 2026-05-20 19:30 UTC · model grok-4.3
The pith
Vision language models display measurable differences in adaptivity but struggle to consistently tailor mathematical instructions to different learner profiles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a rubric derived from learner modeling principles, the study demonstrates that vision language models produce instructional responses with varying degrees of adaptivity to learner profiles in mathematics education, but they do not reliably do so when learner information is limited.
What carries the argument
The learner model-based rubric, which breaks down adaptivity evaluation into cognitive aspects, motivational aspects, and complexity levels for assessing VLM-generated math instructions.
If this is right
- VLMs will need enhancements to handle sparse learner data while still adapting effectively.
- Model selection for educational applications should consider measured adaptivity differences.
- Rubric-based evaluations can guide the development of more responsive tutoring AI.
- Consistent production of adaptive responses could make VLMs more reliable math learning aids.
Where Pith is reading between the lines
- Integrating more structured learner tracking into VLM interactions might improve their educational effectiveness.
- Extending the rubric to other subjects could reveal if adaptivity issues are math-specific or general.
- Hybrid approaches combining VLMs with traditional adaptive systems may address current limitations.
- Testing with real student interactions could validate if rubric scores predict actual learning gains.
Load-bearing premise
That the three-aspect breakdown of adaptivity adequately captures the key ways instruction should vary for different mathematics learners.
What would settle it
A direct comparison where the same math problem is posed to a VLM with two different learner profiles provided, checking if the generated explanations, difficulty, and encouragement levels adjust as expected for each profile.
Figures
read the original abstract
Adaptive learning refers to educational technologies that track learners' learning progress and adapt the instructional process based on individual learners' learning performance. It is increasingly recognized as critical for developing an effective learning support tool. Vision language models (VLMs) have seen adoption in mathematics education, and students have been using them as learning aids for personalized instruction. However, it is unknown whether VLMs have the ability to adapt to different learner profiles when providing mathematical instructions. Current VLMs lack a systematic evaluation framework for this adaptivity to different learner profiles in mathematics tutoring tasks. To address this gap, we draw on the learner model from the adaptive learning framework (Shute and Towle, 2018) and propose a learner model-based rubric. Our rubric formalizes adaptivity assessment into three aspects: cognitive aspects, motivational aspects, and complexity. We also evaluate two additional dimensions of VLM responses: correctness (of answers and solutions) and quality (of the response itself). Our experimental results show measurable differences in adaptivity across models and also reveal that current VLMs struggle to consistently produce learner model-based instructional responses, especially when receiving limited learner information.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a learner model-based rubric, drawing on Shute and Towle (2018), to evaluate whether vision-language models can adapt mathematical instructional responses to different learner profiles. The rubric operationalizes adaptivity along cognitive, motivational, and complexity dimensions (plus correctness and quality), and the experiments report measurable differences across VLMs together with consistent struggles to produce appropriately adapted responses when learner information is limited.
Significance. If the rubric provides a valid operationalization, the work supplies an initial empirical benchmark for VLM adaptivity in mathematics education and identifies concrete limitations that could guide future model development or fine-tuning for personalized tutoring. The study also demonstrates a practical way to apply established learner-model concepts to static VLM outputs.
major comments (2)
- [Methods / Rubric definition] The learner model of Shute and Towle (2018) is a dynamic, multi-turn framework that tracks and updates knowledge, motivation, and task complexity across interactions. The manuscript applies the same three dimensions to single-turn VLM responses conditioned on varying amounts of learner information. No explicit bridging argument or validation is provided showing that one-shot rubric scores faithfully reflect the intended adaptive-tracking mechanism; low scores could therefore arise from prompt sensitivity or rubric mismatch rather than an intrinsic VLM limitation.
- [Experiments / Results] The central claim that VLMs 'struggle to consistently produce learner model-based instructional responses' rests on the rubric scores. Without reported inter-rater reliability, statistical tests for the observed differences, or ablation on prompt phrasing, it is difficult to determine whether the measurable differences are robust or driven by the particular prompt templates and limited-information conditions used in the evaluation.
minor comments (2)
- [Introduction] The abstract and introduction cite Shute and Towle (2018) but do not discuss how the original dynamic model was adapted for static evaluation; a short paragraph clarifying this mapping would improve transparency.
- [Results] Figure or table captions should explicitly state the number of models, prompts, and raters so readers can assess the scale of the reported differences without consulting the main text.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope and rigor of our evaluation framework. We address each major point below and describe the changes we will make in the revised manuscript.
read point-by-point responses
-
Referee: [Methods / Rubric definition] The learner model of Shute and Towle (2018) is a dynamic, multi-turn framework that tracks and updates knowledge, motivation, and task complexity across interactions. The manuscript applies the same three dimensions to single-turn VLM responses conditioned on varying amounts of learner information. No explicit bridging argument or validation is provided showing that one-shot rubric scores faithfully reflect the intended adaptive-tracking mechanism; low scores could therefore arise from prompt sensitivity or rubric mismatch rather than an intrinsic VLM limitation.
Authors: We agree that the Shute and Towle model is designed for dynamic, multi-turn tracking. Our work intentionally evaluates single-turn VLM outputs as an initial benchmark for how well current models incorporate provided learner-profile information in typical one-shot tutoring queries. In the revision we will add a new subsection under Methods that explicitly bridges the dynamic framework to our static operationalization, including the rationale for scoring adaptation based on the information supplied in a single prompt. We will also add a short expert-validation paragraph (two domain experts re-scoring a subset of responses) and a limitations paragraph acknowledging that full multi-turn validation remains future work. revision: partial
-
Referee: [Experiments / Results] The central claim that VLMs 'struggle to consistently produce learner model-based instructional responses' rests on the rubric scores. Without reported inter-rater reliability, statistical tests for the observed differences, or ablation on prompt phrasing, it is difficult to determine whether the measurable differences are robust or driven by the particular prompt templates and limited-information conditions used in the evaluation.
Authors: We accept that the current presentation lacks these quantitative safeguards. In the revised version we will (1) report inter-rater reliability via Cohen’s kappa on the rubric annotations, (2) add appropriate statistical tests (ANOVA with post-hoc corrections) for model and information-level differences, and (3) include a prompt-ablation experiment that varies phrasing while keeping learner information constant. These additions will be placed in a new “Robustness Analyses” subsection of the Experiments section and will directly support the robustness of the reported differences. revision: yes
Circularity Check
No circularity: empirical rubric evaluation is self-contained
full rationale
The paper draws on the external Shute and Towle (2018) learner model to define a three-aspect rubric (cognitive, motivational, complexity) plus correctness and quality dimensions, then reports experimental results on VLM responses to static prompts with varying learner information. No equations, fitted parameters, or self-referential definitions appear; the strongest claims rest on observed differences in rubric scores rather than any reduction of outputs to the paper's own inputs by construction. The cited framework is treated as an independent basis for operationalization, with results presented as direct measurements from the evaluation protocol.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The learner model from Shute and Towle (2018) is appropriate for formalizing adaptivity assessment in VLMs for mathematics tutoring.
Reference graph
Works this paper leans on
-
[1]
Jaylen Jones, Lingbo Mo, Eric Fosler-Lussier, and Huan Sun
Implementation of adaptive learning systems: Current state and potential.Online teaching and learning in higher education, pages 93–115. Jaylen Jones, Lingbo Mo, Eric Fosler-Lussier, and Huan Sun. 2024. A multi-aspect framework for counter narrative evaluation using large language models. In Proceedings of the 2024 Conference of the North American Chapter...
-
[2]
Vincent Liu, Ehsan Latif, and Xiaoming Zhai
Artificial intelligence in intelligent tutoring systems toward sustainable education: a systematic review.Smart learning environments, 10(1):41. Vincent Liu, Ehsan Latif, and Xiaoming Zhai. 2025. Advancing education through tutoring systems: A systematic literature review.arXiv preprint arXiv:2503.09748. Zhengyuan Liu, Stella Xin Yin, Geyu Lin, and Nancy ...
-
[3]
Exploring knowledge tracing in tutor-student dialogues using llms. InProceedings of the 15th In- ternational Learning Analytics and Knowledge Con- ference, pages 249–259. William H Schmidt and Richard T Houang. 2012. Cur- ricular coherence and the common core state stan- dards for mathematics.Educational Researcher, 41(8):294–308. Sahil Sharma, Puneet Mit...
work page 2012
-
[4]
Shuai Wang, Claire Christensen, Wei Cui, Richard Tong, Louise Yarnall, Linda Shear, and Mingyu Feng
Measuring multimodal mathematical reason- ing with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169. Shuai Wang, Claire Christensen, Wei Cui, Richard Tong, Louise Yarnall, Linda Shear, and Mingyu Feng. 2023. When adaptive learning is effective learning: com- parison of an adaptive learning system to teacher- led instru...
work page 2023
-
[5]
y=ab . Which is the value of y when a = 1 and b = 2
From recall to reasoning: Automated question generation for deeper math learning through large language models. InInternational Conference on Artificial Intelligence in Education, pages 414–422. Springer. Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, and 1 others. 2024. Math- vers...
work page 2024
-
[6]
The corresponding API documentation is available at https://platform.openai
GPT-5andGPT-o1are provided by Ope- nAI. The corresponding API documentation is available at https://platform.openai. com/docs/models
-
[7]
Gemini-2.5-flashis provided by Google Gemini, with API documentation available at https://ai.google.dev/gemini-api/ docs
-
[8]
Qwen3-VL-30B-A3B-Instruct2 andLlama- 3.2-11B-Vision-Instruct3 are open-source base model weights obtained from Hugging Face (https://huggingface.co/). For large proprietary models (e.g., GPT-5), a one-time evaluation on 600 samples incurs a cost of approximately $6 for generating explanations. All open-source model evaluations are conducted on a system eq...
-
[9]
Top scale: - Left side: 1 kg + 2 kg - Right side: pineapple - Because they balance, the pineapple must weigh 1 kg + 2 kg = 3 kg
-
[10]
Middle scale: - Left side: pineapple (3 kg) + 2 kg - Right side: the large anvil - These are equal, so the large anvil weighs 3 kg + 2 kg = 5 kg
-
[11]
Bottom scale: - Left side: large anvil (5 kg) + 1 kg - Right side: watermelon - They balance, which means the watermelon weighs 5 kg + 1 kg = 6 kg. Answer: 6 kg. -------------------------------------------------- A GOOD WAY TO THINK ABOUT THIS Try giving each unknown object (pineapple, anvil, watermelon) a “mystery number.” Then use each balance to write ...
-
[12]
This tells us the pineapple weighs 1 kg + 2 kg = **3 kg
On the first scale, the 1 kg weight plus the 2 kg weight together balance with the pineapple. This tells us the pineapple weighs 1 kg + 2 kg = **3 kg
-
[13]
So the large weight must be 3 kg + 2 kg = 5 kg
On the second scale, the pineapple (3kg) plus the 2 kg weight balance with the large black weight on the right. So the large weight must be 3 kg + 2 kg = 5 kg
-
[14]
That means the watermelon weighs 5 kg + 1 kg = 6 kg
On the third scale, the large weight (5 kg) plus the 1 kg weight balance with the watermelon on the right. That means the watermelon weighs 5 kg + 1 kg = 6 kg. Therefore, the answer is 6 kg. High Performance Intermediate Performance Low Performance Figure 6: Case profiles for three learner profiles in G4Q5. I Example Responses Across Context Conditions Ta...
-
[16]
Middle scale: pineapple (3 kg) + 2 kg = single heavy weight → 5 kg
-
[17]
G3 Grade 4, low prior knowledge + full context Let’s solve it step by step:
Bottom scale: 5 kg + 1 kg = watermelon→6 kg Tip: Writing equations for each scale helps solve the puzzle step by step. G3 Grade 4, low prior knowledge + full context Let’s solve it step by step:
-
[18]
Top scale: 1 kg + 2 kg = pineapple→3 kg
-
[19]
Middle scale: pineapple (3 kg) + 2 kg = anvil→5 kg
-
[20]
Bottom scale: anvil (5 kg) + 1 kg = watermelon→6 kg GOOD WAY: Assign a “mystery number” to each object and solve each scale sequentially. G4 Grade 4, low prior knowledge + full context + learner motivation Let’s work through each scale, step by step:
-
[21]
Top: 1 kg + 2 kg = pineapple→3 kg
-
[22]
Middle: pineapple (3 kg) + 2 kg = unlabeled weight→5 kg
-
[23]
Bottom: 5 kg + 1 kg = watermelon→6 kg Therefore, the watermelon weighs 6 kilograms. Table 7: O1 model responses across context conditions (G1–G4) for a Grade 4 learner on a balance scale problem. Increasing context leads to more step-by-step guidance and scaffolding. J Prompt Design Strategies This section provides a comprehensive documentation of the eva...
work page 2019
-
[24]
I got 390 in the TIMSS 2019 Math Test. Can you teach me this math question? Low-performance 8 I am a student from Grade 8, I don’t like learning mathematics and I am not confident in mathematics. Now I have mastered mathematics topics such as number and geometry in grade 8. I got 390 in the TIMSS 2019 Math Test. Can you teach me this math question? Table ...
work page 2019
-
[25]
Demonstrate knowledge of place value (2-digit to 6-digit numbers); represent whole numbers with words, diagrams, number lines, or symbols; order numbers
-
[26]
Add and subtract (up to 4-digit numbers), including computation in simple contextual problems
-
[27]
Multiply (up to 3-digit by 1-digit and 2-digit by 2-digit numbers) and divide (up to 3-digit by 1-digit numbers), including computation in simple contextual problems
-
[28]
Solve problems involving odd and even numbers, multiples and factors of numbers, rounding numbers (up to the nearest ten thousand), and making estimates
-
[29]
Expressions, Simple Equa- tions, and Relation- ships
Combine two or more properties of numbers or operations to solve problems in context. Expressions, Simple Equa- tions, and Relation- ships
-
[30]
Find the missing number or operation in a number sentence (e.g., 17 + w = 29)
-
[31]
Identify or write expressions or number sentences to represent problem situations that may involve unknowns
-
[32]
Identify and use relationships in a well-defined pattern (e.g., describe the relationship between adjacent terms and generate pairs of whole numbers given a rule). Fractions and Decimals
-
[33]
(Fractions may have denominators of 2, 3, 4, 5, 6, 8, 10, 12, or 100.)
Recognize fractions as parts of wholes or collections; represent fractions using words, numbers, or models; compare and order simple fractions; add and subtract simple fractions, including those set in problem situations. (Fractions may have denominators of 2, 3, 4, 5, 6, 8, 10, 12, or 100.)
-
[34]
Demonstrate knowledge of decimal place value including representing decimals using words, numbers, or models; compare, order, and round decimals; add and subtract decimals, including those set in problem situations. (Decimals may have one or two decimal places, allowing for computations with money.) Measurement and Geometry Measurement1. Measure and estim...
-
[35]
Solve problems involving mass (gram and kilogram), volume (milliliter and liter), and time (minutes and hours); identify appropriate types and sizes of units and read scales
-
[36]
Solve problems involving perimeters of polygons, areas of rectangles, areas of shapes covered with squares or partial squares, and volumes filled with cubes. Geometry 1. Identify and draw parallel and perpendicular lines; identify and draw right angles and angles smaller or larger than a right angle; compare angles by size
-
[37]
Use elementary properties, including line and rotational symmetry, to describe, compare, and create common two-dimensional shapes (circles, triangles, quadrilaterals, and other polygons)
-
[38]
Data Reading, Interpret- ing, and Repre- senting Data
Use elementary properties to describe and compare three-dimensional shapes (cubes, rectangular solids, cones, cylinders, and spheres) and relate these with their two-dimensional representations. Data Reading, Interpret- ing, and Repre- senting Data
-
[39]
Read and interpret data from tables, pictographs, bar graphs, line graphs, and pie charts
-
[40]
Organize and represent data to help answer questions. Data Using Data to Solve Problems Use data to answer questions that go beyond directly reading data displays (e.g., solve problems and perform computations using data, combine data from two or more sources, draw conclusions based on data). Table 9:Content Areas & Learning Goals.TIMSS mathematical conte...
-
[41]
Compute and solve problems with positive and negative numbers, including through movement on the number line or various models (e.g., losses and gains, thermometers). Fractions and Decimals
-
[42]
Using various models and representations, compare and order fractions and decimals, and identify equivalent fractions and decimals
-
[43]
Ratio, Propor- tion, and Percent
Compute with fractions and decimals, including those set in problem situations. Ratio, Propor- tion, and Percent
-
[44]
Identify and find equivalent ratios; model a given situation by using a ratio; divide a quantity according to a given ratio
-
[45]
Algebra Expressions, Opera- tions, and Equations
Solve problems involving proportions or percents, including converting between percents and fractions or decimals. Algebra Expressions, Opera- tions, and Equations
-
[46]
Find the value of an expression or a formula given values of the variables
-
[47]
Simplify algebraic expressions involving sums, products, and powers; compare expressions to determine if they are equivalent
-
[48]
Write expressions, equations, or inequalities to represent problem situations
-
[49]
Solve linear equations, linear inequalities, and simultaneous linear equations in two variables, including those that model real life situations. Relationships and Functions
-
[50]
Interpret, relate and generate representations of linear functions in tables, graphs, or words; identify properties of linear functions including slope and intercepts
-
[51]
Geometry Geometric Shapes and Measure- ments
Interpret, relate and generate representations of simple non-linear functions (e.g., quadratic) in tables, graphs, or words; generalize pattern relationships in a sequence using numbers, words, or algebraic expressions. Geometry Geometric Shapes and Measure- ments
-
[52]
Identify and draw types of angles and pairs of lines and use the relationships between angles on lines and in geometric figures to solve problems, including those involving the measures of angles and line segments; solve problems involving points in the Cartesian plane
-
[53]
Identify two-dimensional shapes and use their geometric properties to solve problems, including those involving perimeter, circumference, area, and the Pythagorean Theorem
-
[54]
Recognize and draw images of geometric transformations (translations, reflections, and rotations) in the plane; identify congruent and similar triangles and rectangles and solve related problems
-
[55]
Identify three-dimensional shapes and use their geometric properties to solve problems, including those involving surface area and volume; relate three-dimensional shapes with their two-dimensional representations. Data and Probability Data 1. Read and interpret data from one or more sources to solve problems (e.g., interpolate and extrapolate, make compa...
-
[56]
Identify appropriate procedures for collecting data; organize and represent data to help answer questions
-
[57]
Calculate, use, or interpret statistics (i.e., mean, median, mode, range) summarizing data distributions; recognize the effect of spread and outliers. Probability For simple and compound events: a) determine theoretical probability (based on equally likely outcomes, e.g., rolling a fair die) or b) estimate the empirical probability (based on experimental ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.