pith. sign in

arxiv: 2605.16011 · v1 · pith:CD2SJLH6new · submitted 2026-05-15 · 💻 cs.CL · cs.AI

Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model-based Rubric Study

Pith reviewed 2026-05-20 19:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords vision language modelsmathematics educationadaptive instructionlearner modelsrubric studytutoring systemsmodel evaluationpersonalized learning
0
0 comments X

The pith

Vision language models display measurable differences in adaptivity but struggle to consistently tailor mathematical instructions to different learner profiles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines whether vision language models can adjust their math teaching to fit individual students' needs. It introduces a rubric to score responses on alignment with cognitive level, motivation, and task complexity, while also checking for correct and high-quality content. Experiments across models reveal clear performance variations and frequent failures to maintain adaptive responses when learner details are sparse. Such capabilities matter because students increasingly use these models for personalized math support. The results identify concrete gaps that limit how effectively current models can serve as adaptive tutors.

Core claim

Using a rubric derived from learner modeling principles, the study demonstrates that vision language models produce instructional responses with varying degrees of adaptivity to learner profiles in mathematics education, but they do not reliably do so when learner information is limited.

What carries the argument

The learner model-based rubric, which breaks down adaptivity evaluation into cognitive aspects, motivational aspects, and complexity levels for assessing VLM-generated math instructions.

If this is right

  • VLMs will need enhancements to handle sparse learner data while still adapting effectively.
  • Model selection for educational applications should consider measured adaptivity differences.
  • Rubric-based evaluations can guide the development of more responsive tutoring AI.
  • Consistent production of adaptive responses could make VLMs more reliable math learning aids.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integrating more structured learner tracking into VLM interactions might improve their educational effectiveness.
  • Extending the rubric to other subjects could reveal if adaptivity issues are math-specific or general.
  • Hybrid approaches combining VLMs with traditional adaptive systems may address current limitations.
  • Testing with real student interactions could validate if rubric scores predict actual learning gains.

Load-bearing premise

That the three-aspect breakdown of adaptivity adequately captures the key ways instruction should vary for different mathematics learners.

What would settle it

A direct comparison where the same math problem is posed to a VLM with two different learner profiles provided, checking if the generated explanations, difficulty, and encouragement levels adjust as expected for each profile.

Figures

Figures reproduced from arXiv: 2605.16011 by Adam K. Dube, Jackie Chi Kit Cheung, Jie Gao, Junzhu Su, Yiran Lin, Yongan Yu.

Figure 1
Figure 1. Figure 1: Overview of the learner model-based adap [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the adaptive rubric–based evaluation pipeline. Learner profiles and mathematics questions [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A case study of the adaptive evaluation process for a Grade 4 math problem. The upper panel shows the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Descriptive Plots of Model Performance by Item. The x-axis represents the math items, and the y-axis [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Descriptive Plots by Test Group. Performance stratification across 9 items. The x-axis represents the items, and the y-axis represents the normalized score. One group (open circles) consistently underperforms compared to the others. H Case Profiles To qualitatively illustrate the adaptive capabilities of the evaluated models, we present a case study of a Grade 4 measurement problem [PITH_FULL_IMAGE:figure… view at source ↗
Figure 6
Figure 6. Figure 6: Case profiles for three learner profiles in G4Q5. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
read the original abstract

Adaptive learning refers to educational technologies that track learners' learning progress and adapt the instructional process based on individual learners' learning performance. It is increasingly recognized as critical for developing an effective learning support tool. Vision language models (VLMs) have seen adoption in mathematics education, and students have been using them as learning aids for personalized instruction. However, it is unknown whether VLMs have the ability to adapt to different learner profiles when providing mathematical instructions. Current VLMs lack a systematic evaluation framework for this adaptivity to different learner profiles in mathematics tutoring tasks. To address this gap, we draw on the learner model from the adaptive learning framework (Shute and Towle, 2018) and propose a learner model-based rubric. Our rubric formalizes adaptivity assessment into three aspects: cognitive aspects, motivational aspects, and complexity. We also evaluate two additional dimensions of VLM responses: correctness (of answers and solutions) and quality (of the response itself). Our experimental results show measurable differences in adaptivity across models and also reveal that current VLMs struggle to consistently produce learner model-based instructional responses, especially when receiving limited learner information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a learner model-based rubric, drawing on Shute and Towle (2018), to evaluate whether vision-language models can adapt mathematical instructional responses to different learner profiles. The rubric operationalizes adaptivity along cognitive, motivational, and complexity dimensions (plus correctness and quality), and the experiments report measurable differences across VLMs together with consistent struggles to produce appropriately adapted responses when learner information is limited.

Significance. If the rubric provides a valid operationalization, the work supplies an initial empirical benchmark for VLM adaptivity in mathematics education and identifies concrete limitations that could guide future model development or fine-tuning for personalized tutoring. The study also demonstrates a practical way to apply established learner-model concepts to static VLM outputs.

major comments (2)
  1. [Methods / Rubric definition] The learner model of Shute and Towle (2018) is a dynamic, multi-turn framework that tracks and updates knowledge, motivation, and task complexity across interactions. The manuscript applies the same three dimensions to single-turn VLM responses conditioned on varying amounts of learner information. No explicit bridging argument or validation is provided showing that one-shot rubric scores faithfully reflect the intended adaptive-tracking mechanism; low scores could therefore arise from prompt sensitivity or rubric mismatch rather than an intrinsic VLM limitation.
  2. [Experiments / Results] The central claim that VLMs 'struggle to consistently produce learner model-based instructional responses' rests on the rubric scores. Without reported inter-rater reliability, statistical tests for the observed differences, or ablation on prompt phrasing, it is difficult to determine whether the measurable differences are robust or driven by the particular prompt templates and limited-information conditions used in the evaluation.
minor comments (2)
  1. [Introduction] The abstract and introduction cite Shute and Towle (2018) but do not discuss how the original dynamic model was adapted for static evaluation; a short paragraph clarifying this mapping would improve transparency.
  2. [Results] Figure or table captions should explicitly state the number of models, prompts, and raters so readers can assess the scale of the reported differences without consulting the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and rigor of our evaluation framework. We address each major point below and describe the changes we will make in the revised manuscript.

read point-by-point responses
  1. Referee: [Methods / Rubric definition] The learner model of Shute and Towle (2018) is a dynamic, multi-turn framework that tracks and updates knowledge, motivation, and task complexity across interactions. The manuscript applies the same three dimensions to single-turn VLM responses conditioned on varying amounts of learner information. No explicit bridging argument or validation is provided showing that one-shot rubric scores faithfully reflect the intended adaptive-tracking mechanism; low scores could therefore arise from prompt sensitivity or rubric mismatch rather than an intrinsic VLM limitation.

    Authors: We agree that the Shute and Towle model is designed for dynamic, multi-turn tracking. Our work intentionally evaluates single-turn VLM outputs as an initial benchmark for how well current models incorporate provided learner-profile information in typical one-shot tutoring queries. In the revision we will add a new subsection under Methods that explicitly bridges the dynamic framework to our static operationalization, including the rationale for scoring adaptation based on the information supplied in a single prompt. We will also add a short expert-validation paragraph (two domain experts re-scoring a subset of responses) and a limitations paragraph acknowledging that full multi-turn validation remains future work. revision: partial

  2. Referee: [Experiments / Results] The central claim that VLMs 'struggle to consistently produce learner model-based instructional responses' rests on the rubric scores. Without reported inter-rater reliability, statistical tests for the observed differences, or ablation on prompt phrasing, it is difficult to determine whether the measurable differences are robust or driven by the particular prompt templates and limited-information conditions used in the evaluation.

    Authors: We accept that the current presentation lacks these quantitative safeguards. In the revised version we will (1) report inter-rater reliability via Cohen’s kappa on the rubric annotations, (2) add appropriate statistical tests (ANOVA with post-hoc corrections) for model and information-level differences, and (3) include a prompt-ablation experiment that varies phrasing while keeping learner information constant. These additions will be placed in a new “Robustness Analyses” subsection of the Experiments section and will directly support the robustness of the reported differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical rubric evaluation is self-contained

full rationale

The paper draws on the external Shute and Towle (2018) learner model to define a three-aspect rubric (cognitive, motivational, complexity) plus correctness and quality dimensions, then reports experimental results on VLM responses to static prompts with varying learner information. No equations, fitted parameters, or self-referential definitions appear; the strongest claims rest on observed differences in rubric scores rather than any reduction of outputs to the paper's own inputs by construction. The cited framework is treated as an independent basis for operationalization, with results presented as direct measurements from the evaluation protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the appropriateness of the Shute and Towle learner model for VLM evaluation and the assumption that the three-aspect rubric plus correctness and quality dimensions capture meaningful adaptivity.

axioms (1)
  • domain assumption The learner model from Shute and Towle (2018) is appropriate for formalizing adaptivity assessment in VLMs for mathematics tutoring.
    The paper explicitly draws on this model to define the rubric's three aspects: cognitive, motivational, and complexity.

pith-pipeline@v0.9.0 · 5747 in / 1219 out tokens · 37911 ms · 2026-05-20T19:30:59.752872+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages

  1. [1]

    Jaylen Jones, Lingbo Mo, Eric Fosler-Lussier, and Huan Sun

    Implementation of adaptive learning systems: Current state and potential.Online teaching and learning in higher education, pages 93–115. Jaylen Jones, Lingbo Mo, Eric Fosler-Lussier, and Huan Sun. 2024. A multi-aspect framework for counter narrative evaluation using large language models. In Proceedings of the 2024 Conference of the North American Chapter...

  2. [2]

    Vincent Liu, Ehsan Latif, and Xiaoming Zhai

    Artificial intelligence in intelligent tutoring systems toward sustainable education: a systematic review.Smart learning environments, 10(1):41. Vincent Liu, Ehsan Latif, and Xiaoming Zhai. 2025. Advancing education through tutoring systems: A systematic literature review.arXiv preprint arXiv:2503.09748. Zhengyuan Liu, Stella Xin Yin, Geyu Lin, and Nancy ...

  3. [3]

    InProceedings of the 15th In- ternational Learning Analytics and Knowledge Con- ference, pages 249–259

    Exploring knowledge tracing in tutor-student dialogues using llms. InProceedings of the 15th In- ternational Learning Analytics and Knowledge Con- ference, pages 249–259. William H Schmidt and Richard T Houang. 2012. Cur- ricular coherence and the common core state stan- dards for mathematics.Educational Researcher, 41(8):294–308. Sahil Sharma, Puneet Mit...

  4. [4]

    Shuai Wang, Claire Christensen, Wei Cui, Richard Tong, Louise Yarnall, Linda Shear, and Mingyu Feng

    Measuring multimodal mathematical reason- ing with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169. Shuai Wang, Claire Christensen, Wei Cui, Richard Tong, Louise Yarnall, Linda Shear, and Mingyu Feng. 2023. When adaptive learning is effective learning: com- parison of an adaptive learning system to teacher- led instru...

  5. [5]

    y=ab . Which is the value of y when a = 1 and b = 2

    From recall to reasoning: Automated question generation for deeper math learning through large language models. InInternational Conference on Artificial Intelligence in Education, pages 414–422. Springer. Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, and 1 others. 2024. Math- vers...

  6. [6]

    The corresponding API documentation is available at https://platform.openai

    GPT-5andGPT-o1are provided by Ope- nAI. The corresponding API documentation is available at https://platform.openai. com/docs/models

  7. [7]

    Gemini-2.5-flashis provided by Google Gemini, with API documentation available at https://ai.google.dev/gemini-api/ docs

  8. [8]

    For large proprietary models (e.g., GPT-5), a one-time evaluation on 600 samples incurs a cost of approximately $6 for generating explanations

    Qwen3-VL-30B-A3B-Instruct2 andLlama- 3.2-11B-Vision-Instruct3 are open-source base model weights obtained from Hugging Face (https://huggingface.co/). For large proprietary models (e.g., GPT-5), a one-time evaluation on 600 samples incurs a cost of approximately $6 for generating explanations. All open-source model evaluations are conducted on a system eq...

  9. [9]

    Top scale: - Left side: 1 kg + 2 kg - Right side: pineapple - Because they balance, the pineapple must weigh 1 kg + 2 kg = 3 kg

  10. [10]

    Middle scale: - Left side: pineapple (3 kg) + 2 kg - Right side: the large anvil - These are equal, so the large anvil weighs 3 kg + 2 kg = 5 kg

  11. [11]

    mystery number

    Bottom scale: - Left side: large anvil (5 kg) + 1 kg - Right side: watermelon - They balance, which means the watermelon weighs 5 kg + 1 kg = 6 kg. Answer: 6 kg. -------------------------------------------------- A GOOD WAY TO THINK ABOUT THIS Try giving each unknown object (pineapple, anvil, watermelon) a “mystery number.” Then use each balance to write ...

  12. [12]

    This tells us the pineapple weighs 1 kg + 2 kg = **3 kg

    On the first scale, the 1 kg weight plus the 2 kg weight together balance with the pineapple. This tells us the pineapple weighs 1 kg + 2 kg = **3 kg

  13. [13]

    So the large weight must be 3 kg + 2 kg = 5 kg

    On the second scale, the pineapple (3kg) plus the 2 kg weight balance with the large black weight on the right. So the large weight must be 3 kg + 2 kg = 5 kg

  14. [14]

    That means the watermelon weighs 5 kg + 1 kg = 6 kg

    On the third scale, the large weight (5 kg) plus the 1 kg weight balance with the watermelon on the right. That means the watermelon weighs 5 kg + 1 kg = 6 kg. Therefore, the answer is 6 kg. High Performance Intermediate Performance Low Performance Figure 6: Case profiles for three learner profiles in G4Q5. I Example Responses Across Context Conditions Ta...

  15. [16]

    Middle scale: pineapple (3 kg) + 2 kg = single heavy weight → 5 kg

  16. [17]

    G3 Grade 4, low prior knowledge + full context Let’s solve it step by step:

    Bottom scale: 5 kg + 1 kg = watermelon→6 kg Tip: Writing equations for each scale helps solve the puzzle step by step. G3 Grade 4, low prior knowledge + full context Let’s solve it step by step:

  17. [18]

    Top scale: 1 kg + 2 kg = pineapple→3 kg

  18. [19]

    Middle scale: pineapple (3 kg) + 2 kg = anvil→5 kg

  19. [20]

    mystery number

    Bottom scale: anvil (5 kg) + 1 kg = watermelon→6 kg GOOD WAY: Assign a “mystery number” to each object and solve each scale sequentially. G4 Grade 4, low prior knowledge + full context + learner motivation Let’s work through each scale, step by step:

  20. [21]

    Top: 1 kg + 2 kg = pineapple→3 kg

  21. [22]

    Middle: pineapple (3 kg) + 2 kg = unlabeled weight→5 kg

  22. [23]

    Table 7: O1 model responses across context conditions (G1–G4) for a Grade 4 learner on a balance scale problem

    Bottom: 5 kg + 1 kg = watermelon→6 kg Therefore, the watermelon weighs 6 kilograms. Table 7: O1 model responses across context conditions (G1–G4) for a Grade 4 learner on a balance scale problem. Increasing context leads to more step-by-step guidance and scaffolding. J Prompt Design Strategies This section provides a comprehensive documentation of the eva...

  23. [24]

    Can you teach me this math question? Low-performance 8 I am a student from Grade 8, I don’t like learning mathematics and I am not confident in mathematics

    I got 390 in the TIMSS 2019 Math Test. Can you teach me this math question? Low-performance 8 I am a student from Grade 8, I don’t like learning mathematics and I am not confident in mathematics. Now I have mastered mathematics topics such as number and geometry in grade 8. I got 390 in the TIMSS 2019 Math Test. Can you teach me this math question? Table ...

  24. [25]

    Demonstrate knowledge of place value (2-digit to 6-digit numbers); represent whole numbers with words, diagrams, number lines, or symbols; order numbers

  25. [26]

    Add and subtract (up to 4-digit numbers), including computation in simple contextual problems

  26. [27]

    Multiply (up to 3-digit by 1-digit and 2-digit by 2-digit numbers) and divide (up to 3-digit by 1-digit numbers), including computation in simple contextual problems

  27. [28]

    Solve problems involving odd and even numbers, multiples and factors of numbers, rounding numbers (up to the nearest ten thousand), and making estimates

  28. [29]

    Expressions, Simple Equa- tions, and Relation- ships

    Combine two or more properties of numbers or operations to solve problems in context. Expressions, Simple Equa- tions, and Relation- ships

  29. [30]

    Find the missing number or operation in a number sentence (e.g., 17 + w = 29)

  30. [31]

    Identify or write expressions or number sentences to represent problem situations that may involve unknowns

  31. [32]

    Fractions and Decimals

    Identify and use relationships in a well-defined pattern (e.g., describe the relationship between adjacent terms and generate pairs of whole numbers given a rule). Fractions and Decimals

  32. [33]

    (Fractions may have denominators of 2, 3, 4, 5, 6, 8, 10, 12, or 100.)

    Recognize fractions as parts of wholes or collections; represent fractions using words, numbers, or models; compare and order simple fractions; add and subtract simple fractions, including those set in problem situations. (Fractions may have denominators of 2, 3, 4, 5, 6, 8, 10, 12, or 100.)

  33. [34]

    (Decimals may have one or two decimal places, allowing for computations with money.) Measurement and Geometry Measurement1

    Demonstrate knowledge of decimal place value including representing decimals using words, numbers, or models; compare, order, and round decimals; add and subtract decimals, including those set in problem situations. (Decimals may have one or two decimal places, allowing for computations with money.) Measurement and Geometry Measurement1. Measure and estim...

  34. [35]

    Solve problems involving mass (gram and kilogram), volume (milliliter and liter), and time (minutes and hours); identify appropriate types and sizes of units and read scales

  35. [36]

    Geometry 1

    Solve problems involving perimeters of polygons, areas of rectangles, areas of shapes covered with squares or partial squares, and volumes filled with cubes. Geometry 1. Identify and draw parallel and perpendicular lines; identify and draw right angles and angles smaller or larger than a right angle; compare angles by size

  36. [37]

    Use elementary properties, including line and rotational symmetry, to describe, compare, and create common two-dimensional shapes (circles, triangles, quadrilaterals, and other polygons)

  37. [38]

    Data Reading, Interpret- ing, and Repre- senting Data

    Use elementary properties to describe and compare three-dimensional shapes (cubes, rectangular solids, cones, cylinders, and spheres) and relate these with their two-dimensional representations. Data Reading, Interpret- ing, and Repre- senting Data

  38. [39]

    Read and interpret data from tables, pictographs, bar graphs, line graphs, and pie charts

  39. [40]

    Organize and represent data to help answer questions. Data Using Data to Solve Problems Use data to answer questions that go beyond directly reading data displays (e.g., solve problems and perform computations using data, combine data from two or more sources, draw conclusions based on data). Table 9:Content Areas & Learning Goals.TIMSS mathematical conte...

  40. [41]

    Fractions and Decimals

    Compute and solve problems with positive and negative numbers, including through movement on the number line or various models (e.g., losses and gains, thermometers). Fractions and Decimals

  41. [42]

    Using various models and representations, compare and order fractions and decimals, and identify equivalent fractions and decimals

  42. [43]

    Ratio, Propor- tion, and Percent

    Compute with fractions and decimals, including those set in problem situations. Ratio, Propor- tion, and Percent

  43. [44]

    Identify and find equivalent ratios; model a given situation by using a ratio; divide a quantity according to a given ratio

  44. [45]

    Algebra Expressions, Opera- tions, and Equations

    Solve problems involving proportions or percents, including converting between percents and fractions or decimals. Algebra Expressions, Opera- tions, and Equations

  45. [46]

    Find the value of an expression or a formula given values of the variables

  46. [47]

    Simplify algebraic expressions involving sums, products, and powers; compare expressions to determine if they are equivalent

  47. [48]

    Write expressions, equations, or inequalities to represent problem situations

  48. [49]

    Relationships and Functions

    Solve linear equations, linear inequalities, and simultaneous linear equations in two variables, including those that model real life situations. Relationships and Functions

  49. [50]

    Interpret, relate and generate representations of linear functions in tables, graphs, or words; identify properties of linear functions including slope and intercepts

  50. [51]

    Geometry Geometric Shapes and Measure- ments

    Interpret, relate and generate representations of simple non-linear functions (e.g., quadratic) in tables, graphs, or words; generalize pattern relationships in a sequence using numbers, words, or algebraic expressions. Geometry Geometric Shapes and Measure- ments

  51. [52]

    Identify and draw types of angles and pairs of lines and use the relationships between angles on lines and in geometric figures to solve problems, including those involving the measures of angles and line segments; solve problems involving points in the Cartesian plane

  52. [53]

    Identify two-dimensional shapes and use their geometric properties to solve problems, including those involving perimeter, circumference, area, and the Pythagorean Theorem

  53. [54]

    Recognize and draw images of geometric transformations (translations, reflections, and rotations) in the plane; identify congruent and similar triangles and rectangles and solve related problems

  54. [55]

    Data and Probability Data 1

    Identify three-dimensional shapes and use their geometric properties to solve problems, including those involving surface area and volume; relate three-dimensional shapes with their two-dimensional representations. Data and Probability Data 1. Read and interpret data from one or more sources to solve problems (e.g., interpolate and extrapolate, make compa...

  55. [56]

    Identify appropriate procedures for collecting data; organize and represent data to help answer questions

  56. [57]

    Calculate, use, or interpret statistics (i.e., mean, median, mode, range) summarizing data distributions; recognize the effect of spread and outliers. Probability For simple and compound events: a) determine theoretical probability (based on equally likely outcomes, e.g., rolling a fair die) or b) estimate the empirical probability (based on experimental ...