pith. sign in

arxiv: 2604.13666 · v1 · submitted 2026-04-15 · 💻 cs.CY · cs.AI· cs.LG

Automatically Inferring Teachers' Geometric Content Knowledge: A Skills Based Approach

Pith reviewed 2026-05-10 12:38 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.LG
keywords Van Hiele modelgeometric content knowledgeautomated assessmentlarge language modelsskills dictionaryteacher educationretrieval-augmented generationmulti-task learning
0
0 comments X

The pith

Integrating a skills dictionary into LLM classifiers significantly improves automatic assessment of teachers' Van Hiele geometric reasoning levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an automated method to diagnose teachers' geometric reasoning levels using the Van Hiele model. It builds a dictionary of 33 fine-grained skills and compares skills-aware retrieval-augmented generation and multi-task learning approaches against baselines without skills info. Skills-aware methods outperform baselines on 226 annotated responses from pre-service teachers. This enables scalable assessment instead of manual expert analysis.

Core claim

By grounding large language models in a structured skills dictionary that decomposes Van Hiele levels, the skills-aware variants of RAG and MTL classifiers achieve significantly better performance in predicting the Van Hiele level and demonstrated skills from open-ended teacher responses.

What carries the argument

The skills dictionary of 33 fine-grained reasoning skills derived from Van Hiele levels, incorporated into retrieval-augmented generation and multi-task learning classifiers.

If this is right

  • Provides the first automated approach for classifying Van Hiele levels from open-ended responses.
  • Supports large-scale evaluation of teachers' geometric content knowledge.
  • Enables adaptive and personalized teacher learning systems based on diagnosed reasoning levels.
  • Reduces time and cost compared to traditional manual expert analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar skills-dictionary approaches could extend to assessing student geometric reasoning or other subject areas.
  • Improving the quality of expert annotations might further boost classifier accuracy.
  • Integration with real-time classroom tools could allow ongoing monitoring of teacher development.

Load-bearing premise

Expert annotations of the 226 responses with Van Hiele levels and demonstrated skills provide reliable ground truth for training and evaluating the classifiers.

What would settle it

A larger dataset where skills-aware classifiers do not outperform baselines, or where inter-annotator agreement on Van Hiele levels is low.

Figures

Figures reproduced from arXiv: 2604.13666 by Avi Segal, Hassan Ayoob, Inbal Israel, Kobi Gal, Osama Swidan, Ziv Fenigstein.

Figure 1
Figure 1. Figure 1: Overview of the proposed skills-based Van Hiele classification framework [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Retrieval-augmented generation architecture for Method I. the system prompt includes both Van Hiele level definitions and the full skills dictionary. All other components of the pipeline - including the input encoding, retrieval procedure, number of retrieved examples, language model, and inference settings - are the same for both variants. As a result, any observed performance differ￾ences between the bas… view at source ↗
Figure 3
Figure 3. Figure 3: ) fine-tunes an open-source LLM for Van Hiele classification. Similarly to Method I, the model receives a question-response pair as input and outputs the Van Hiele level. We implemented two variants: a baseline that performs clas￾sification without skills information during training, and a skills-aware variant that incorporates an auxiliary skills prediction task with Van Hiele classification. Both variant… view at source ↗
Figure 4
Figure 4. Figure 4: Average results across 5-fold cross-validation comparing baseline and skills￾aware variants for RAG (Method I) and MTL (Method II). Numbers above bars show mean scores; lines within the bars indicate standard deviation. For MAE measure, lower is better. validate the performance gain of the skills-aware variant to selecting the right skills for the right question-response pair, as opposed to simply providin… view at source ↗
read the original abstract

Assessing teachers' geometric content knowledge is essential for geometry instructional quality and student learning, but difficult to scale. The Van Hiele model characterizes geometric reasoning through five hierarchical levels. Traditional Van Hiele assessment relies on manual expert analysis of open-ended responses. This process is time-consuming, costly, and prevents large-scale evaluation. This study develops an automated approach for diagnosing teachers' Van Hiele reasoning levels using large language models grounded in educational theory. Our central hypothesis is that integrating explicit skills information significantly improves Van Hiele classification. In collaboration with mathematics education researchers, we built a structured skills dictionary decomposing the Van Hiele levels into 33 fine-grained reasoning skills. Through a custom web platform, 31 pre-service teachers solved geometry problems, yielding 226 responses. Expert researchers then annotated each response with its Van Hiele level and demonstrated skills from the dictionary. Using this annotated dataset, we implemented two classification approaches: (1) retrieval-augmented generation (RAG) and (2) multi-task learning (MTL). Each approach compared a skills-aware variant incorporating the skills dictionary against a baseline without skills information. Results showed that for both methods, skills-aware variants significantly outperformed baselines across multiple evaluation metrics. This work provides the first automated approach for Van Hiele level classification from open-ended responses. It offers a scalable, theory-grounded method for assessing teachers' geometric reasoning that can enable large-scale evaluation and support adaptive, personalized teacher learning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops an automated LLM-based approach to classify pre-service teachers' Van Hiele geometric reasoning levels from open-ended responses. It constructs a 33-item skills dictionary decomposing the five Van Hiele levels, collects 226 annotated responses from 31 teachers via a web platform, and evaluates two methods (RAG and MTL) by comparing skills-aware variants that incorporate the dictionary against baselines without skills information, claiming that the skills-aware versions significantly outperform the baselines on multiple metrics.

Significance. If the central results hold after addressing label reliability, the work would be significant as the first automated, theory-grounded system for scaling Van Hiele assessment of teachers' geometric content knowledge. The explicit skills dictionary provides a falsifiable bridge between educational theory and LLM classification, with potential to support large-scale teacher evaluation and adaptive professional development. The dual-method design (RAG and MTL) and use of expert-annotated real responses strengthen the contribution relative to purely synthetic or ungrounded LLM prompting approaches.

major comments (2)
  1. Annotation process (described in the abstract and data collection): The 226 responses are annotated by expert researchers with Van Hiele levels and demonstrated skills from the 33-item dictionary to serve as ground truth for training and evaluating both the RAG and MTL classifiers. No inter-rater reliability statistics (e.g., Cohen's kappa, Fleiss' kappa, or raw agreement percentages) are reported, and there is no description of how multiple annotators resolved disagreements or validated consistent application of the skills dictionary. Because the headline claim that 'skills-aware variants significantly outperformed baselines' rests directly on these labels, any systematic noise or bias in the annotations would inflate the reported performance gap and prevent interpretation of the result as evidence for the value of the skills dictionary.
  2. Results and evaluation (abstract and corresponding results section): The abstract states that skills-aware variants 'significantly outperformed baselines across multiple evaluation metrics' but supplies no numerical values, statistical tests (e.g., p-values, confidence intervals), baseline definitions, dataset split details, or error analysis. Without these, it is impossible to assess the magnitude of improvement, whether the gains are robust to different splits, or whether they survive correction for multiple comparisons.
minor comments (2)
  1. The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy or F1 improvement) to allow readers to gauge the practical significance of the outperformance claim.
  2. Clarify the exact prompting strategy and retrieval mechanism used in the RAG skills-aware variant, including how the skills dictionary is injected at inference time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which highlight important aspects of transparency in our annotation process and results reporting. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: Annotation process (described in the abstract and data collection): The 226 responses are annotated by expert researchers with Van Hiele levels and demonstrated skills from the 33-item dictionary to serve as ground truth for training and evaluating both the RAG and MTL classifiers. No inter-rater reliability statistics (e.g., Cohen's kappa, Fleiss' kappa, or raw agreement percentages) are reported, and there is no description of how multiple annotators resolved disagreements or validated consistent application of the skills dictionary. Because the headline claim that 'skills-aware variants significantly outperformed baselines' rests directly on these labels, any systematic noise or bias in the annotations would inflate the reported performance gap and prevent interpretation of the result as evidence for the value of the skills dictionary.

    Authors: We agree that explicit reporting of inter-rater reliability is necessary to substantiate the ground-truth labels. The annotations were performed collaboratively by multiple expert researchers in mathematics education, with disagreements resolved via discussion until consensus was reached. In the revised manuscript, we will expand the data collection section to include: (i) the number of annotators, (ii) the full annotation protocol and training on the skills dictionary, (iii) raw agreement percentages, and (iv) Cohen's kappa (or Fleiss' kappa) computed separately for Van Hiele level assignments and for the 33-skill annotations. These additions will directly address concerns about label reliability and allow readers to assess the robustness of the performance gains. revision: yes

  2. Referee: Results and evaluation (abstract and corresponding results section): The abstract states that skills-aware variants 'significantly outperformed baselines across multiple evaluation metrics' but supplies no numerical values, statistical tests (e.g., p-values, confidence intervals), baseline definitions, dataset split details, or error analysis. Without these, it is impossible to assess the magnitude of improvement, whether the gains are robust to different splits, or whether they survive correction for multiple comparisons.

    Authors: We acknowledge that the submitted abstract and results section did not provide sufficient quantitative detail for independent evaluation. The manuscript does contain the underlying performance tables and comparisons, but we agree these must be foregrounded. In the revision we will: (i) update the abstract to report the key metric improvements (e.g., accuracy/F1 deltas) and associated p-values, (ii) add explicit definitions of the baselines, (iii) detail the train/test splits and any cross-validation procedure, (iv) include confidence intervals and error analysis, and (v) state whether multiple-comparison corrections were applied. These changes will enable readers to judge both the magnitude and statistical robustness of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity: standard supervised ML on external expert annotations

full rationale

The paper's derivation consists of (1) constructing a 33-item skills dictionary in collaboration with education researchers, (2) collecting 226 open-ended responses from 31 pre-service teachers, (3) having expert researchers annotate each response with a Van Hiele level plus demonstrated skills from the dictionary, and (4) training/evaluating two classifiers (RAG and MTL) that compare a skills-aware variant against a baseline without skills information. All performance claims are measured against the held-out expert annotations as ground truth. This is a conventional empirical pipeline; the target labels are produced by independent human experts rather than being defined by the model's own outputs or by any fitted parameter that is later renamed as a prediction. No equations, self-citations, or uniqueness theorems appear in the provided text that would reduce the reported outperformance to a definitional identity. The central result therefore remains falsifiable by new annotations or new data and does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review based on abstract only; full paper may contain additional parameters or assumptions not visible here.

axioms (2)
  • domain assumption The Van Hiele model accurately characterizes geometric reasoning through five hierarchical levels.
    Foundational to the entire classification task and skills dictionary construction.
  • domain assumption Expert researchers can reliably annotate responses with both Van Hiele levels and the 33 specific skills.
    These annotations serve as ground truth for both training and evaluation of the models.

pith-pipeline@v0.9.0 · 5579 in / 1277 out tokens · 35654 ms · 2026-05-10T12:38:32.760763+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 2 internal anchors

  1. [1]

    Proceedings of the AAAI Conference on Artificial Intelligence 36(4), 4424–4431 (Jun 2022)

    An, S., Kim, J., Kim, M., Park, J.: No task left behind: Mult i-task learning of knowledge tracing and option tracing for better student ass essment. Proceedings of the AAAI Conference on Artificial Intelligence 36(4), 4424–4431 (Jun 2022). https://doi.org/10.1609/aaai.v36i4.20364

  2. [2]

    Interna- tional Journal of Research in Education and Science 4(1), 314–330 (2018), https://eric.ed.gov/?id=EJ1169856

    Armah, R.B., Cofie, P.O., Okpoti, C.A.: Investigating the effect of Van Hiele phase-based instruction on pre-service teachers’ geometr ic thinking. Interna- tional Journal of Research in Education and Science 4(1), 314–330 (2018), https://eric.ed.gov/?id=EJ1169856

  3. [3]

    Digital Di scovery 4, 500–512 (2025)

    Bajan, C., Lambard, G.: Exploring the expertise of large l anguage models in ma- terials science and metallurgical engineering. Digital Di scovery 4, 500–512 (2025). https://doi.org/10.1039/D4DD00319E

  4. [4]

    Mathematics Teacher Education and D evelopment 14(2), 70–90 (2012)

    Beswick, K., Goos, M.: Measuring pre-service primary tea chers’ knowledge for teaching mathematics. Mathematics Teacher Education and D evelopment 14(2), 70–90 (2012)

  5. [5]

    The elementary school journal 111(3), 430–454 (2011)

    Campbell, P.F., Malkus, N.N.: The impact of elementary ma thematics coaches on student achievement. The elementary school journal 111(3), 430–454 (2011). https://doi.org/10.1086/657654

  6. [6]

    Computers & ed ucation 210 (2024)

    Copur-Gencturk, Y., Li, J., Cohen, A.S., Orrill, C.H.: Th e impact of an interactive, personalized computer-based teacher professional develo pment program on student performance: A randomized controlled trial. Computers & ed ucation 210 (2024). https://doi.org/10.1016/j.compedu.2023.104963

  7. [7]

    Learning and teaching geometry, K-12 1, 1–16 (1987)

    Crowley, M.L.: The van hiele model of the development of ge ometric thought. Learning and teaching geometry, K-12 1, 1–16 (1987)

  8. [8]

    , Mazaré, P.E., Lomeli, M., Hosseini, L., Jégou, H.: The faiss library

    Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G. , Mazaré, P.E., Lomeli, M., Hosseini, L., Jégou, H.: The faiss library. IEEE Transac tions on Big Data pp. 1–17 (2025). https://doi.org/10.1109/TBDATA.2025.3618474

  9. [9]

    IEEE A ccess 12 (2024)

    Fateen, M., Wang, B., Mine, T.: Beyond scores: A modular ra g-based sys- tem for automatic short answer scoring with feedback. IEEE A ccess 12 (2024). https://doi.org/10.1109/ACCESS.2024.3508747

  10. [10]

    : Predictive stu- dent modeling in educational games with multi-task learnin g

    Geden, M., Emerson, A., Rowe, J., Azevedo, R., Lester, J. : Predictive stu- dent modeling in educational games with multi-task learnin g. Proceedings of the AAAI Conference on Artificial Intelligence 34(01), 654–661 (Apr 2020). https://doi.org/10.1609/aaai.v34i01.5406

  11. [11]

    Journal for Research in Mathematics Education 22(3), 237–251 (1991)

    Gutiérrez, A., Jaime, A., Fortuny, J.M.: An alternative paradigm to evaluate the acquisition of the van hiele levels. Journal for Research in Mathematics Education 22(3), 237–251 (1991)

  12. [12]

    In- ternational journal of artificial intelligence in educatio n 35(2), 651–676 (2025)

    Henkel, O., Hills, L., Roberts, B., McGrane, J.: Can llms grade open response reading comprehension questions? an empirical study using the roars dataset. In- ternational journal of artificial intelligence in educatio n 35(2), 651–676 (2025). https://doi.org/10.1007/s40593-024-00431-z

  13. [13]

    ERIC Arc hive (translated from original Dutch) (1959), originally presented at the OEEC co nference, Sèvres, 1957

    van Hiele, P.: The child’s thought and geometry. ERIC Arc hive (translated from original Dutch) (1959), originally presented at the OEEC co nference, Sèvres, 1957

  14. [14]

    In: International Conference on Artificial Intelligence in Education

    Huang, G.Y., Chen, J., Liu, H., Fu, W., Ding, W., Tang, J., Yang, S., Li, G., Liu, Z.: Neural multi-task learning for teacher question detect ion in online classrooms. In: International Conference on Artificial Intelligence in Education. pp. 269–281. Springer (2020). https://doi.org/10.1007/978-3-030-52237-7_22

  15. [15]

    In: Proceedings of the 18th PME Conference

    Jaime, A., Gutiérrez, A.: A model of test design to assess the van hiele levels. In: Proceedings of the 18th PME Conference. vol. 3, pp. 41–48. Pm e (1994) Automatically Inferring Teachers’ Geometric Content Know ledge 15

  16. [16]

    Jauhiainen, J.S., Guerra, A.G.: Evaluating students’ o pen-ended written responses with llms: Using the rag framework for gpt-3.5, gpt-4, claud e-3, and mistral-large (2024), https://arxiv.org/abs/2405.05444

  17. [17]

    Journal of Physics: Con ference Series 1013(1), 012117 (may 2018)

    Jupri, A.: Using the van hiele theory to analyze primary s chool teachers’ written work on geometrical proof problems. Journal of Physics: Con ference Series 1013(1), 012117 (may 2018). https://doi.org/10.1088/1742-6596/1013/1/012117

  18. [18]

    International Electronic Journal of Elementa ry Education 12(4), 303– 309 (2020)

    Kurt-Birel, G., Deniz, S., Önel, F.: Analysis of primary school teachers’ knowledge of geometry. International Electronic Journal of Elementa ry Education 12(4), 303– 309 (2020). https://doi.org/10.26822/iejee.2020459459

  19. [19]

    Com- puters and Education: Artificial Intelligence 6, 100213 (2024)

    Lee, G.G., Latif, E., Wu, X., Liu, N., Zhai, X.: Applying l arge language models and chain-of-thought for automatic scorin g. Com- puters and Education: Artificial Intelligence 6, 100213 (2024). https://doi.org/https://doi.org/10.1016/j.caeai.2024.100213

  20. [20]

    Leto, A., Aguerrebere, C., Bhati, I., Willke, T., Tepper , M., Vo, V.A.: Toward optimal search and retrieval for rag (2024), https://arxiv.org/abs/2411.07396

  21. [21]

    In- ternational Journal of Studies in Education and Science (IJ SES) 4(2), 113–123 (2023)

    Lumbre, A.P., Beltran-Joaquin, M.N., Monterola, S.L.C .: Relationship between mathematics teachers’ van hiele levels and students’ achie vement in geometry. In- ternational Journal of Studies in Education and Science (IJ SES) 4(2), 113–123 (2023)

  22. [22]

    Mathematics Teaching-Resea rch Journal 13(3), 99– 130 (2021)

    Manero, V., Arnal-Bailera, A.: Understanding proof pra ctices of pre-service math- ematics teachers in geometry. Mathematics Teaching-Resea rch Journal 13(3), 99– 130 (2021)

  23. [23]

    Journal for research in mathematics educati on 14(1), 58–69 (1983)

    Mayberry, J.: The van hiele levels of geometric thought i n undergraduate preser- vice teachers. Journal for research in mathematics educati on 14(1), 58–69 (1983). https://doi.org/10.5951/jresematheduc.14.1.0058

  24. [24]

    Disciplinary and Interdisciplinary Science Educat ion Research 8(3) (2026)

    Rachmatullah, A., Tayde, S., Alozie, N., et al.: Explori ng large language model’s ca- pabilities in identifying science teacher pck using lesson plans and open-ended ques- tions. Disciplinary and Interdisciplinary Science Educat ion Research 8(3) (2026). https://doi.org/10.1186/s43031-025-00151-x

  25. [25]

    Journal for Research in Mathem atics Education 28(4), 467–483 (1997)

    Swafford, J.O., Jones, G.A., Thornton, C.A.: Increased k nowledge in geometry and instructional practice. Journal for Research in Mathem atics Education 28(4), 467–483 (1997). https://doi.org/10.5951/jresematheduc.28.4.0467

  26. [26]

    Journal of Physics: Conference Series 1882(1), 012042 (may 2021)

    Tamam, B., Dasari, D.: The use of geogebra software in tea ching mathe- matics. Journal of Physics: Conference Series 1882(1), 012042 (may 2021). https://doi.org/10.1088/1742-6596/1882/1/012042

  27. [27]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., et al.: Gemini: a family of highly capable multi modal models. arXiv preprint arXiv:2312.11805 (2023). https://doi.org/10.48550/arXiv.2312.11805

  28. [28]

    Gemma 3 Technical Report

    Team, G., et al.: Gemma 3 technical report. arXiv preprin t arXiv:2503.19786 (2025). https://doi.org/10.48550/arXiv.2503.19786

  29. [29]

    cdassg project

    Usiskin, Z.: Van hiele levels and achievement in seconda ry school geometry. cdassg project. ERIC (1982)

  30. [30]

    , author Ramzi, Z

    Xu, M., Huang, K., Qi, X.: Multi-task learning with conte xt-oriented self-attention for breast ultrasound image classification and segmentatio n. In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI) . pp. 1–5 (2022). https://doi.org/10.1109/ISBI52829.2022.9761685

  31. [31]

    Teaching and Teacher Education 91, 103038 (2020)

    Yi, M., Flores, R., Wang, J.: Examining the influence of va n hiele theory-based instructional activities on elementary preservice teache rs’ geometry knowledge for teaching 2-d shapes. Teaching and Teacher Education 91, 103038 (2020). https://doi.org/https://doi.org/10.1016/j.tate.2020.103038