Automatically Inferring Teachers' Geometric Content Knowledge: A Skills Based Approach
Pith reviewed 2026-05-10 12:38 UTC · model grok-4.3
The pith
Integrating a skills dictionary into LLM classifiers significantly improves automatic assessment of teachers' Van Hiele geometric reasoning levels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By grounding large language models in a structured skills dictionary that decomposes Van Hiele levels, the skills-aware variants of RAG and MTL classifiers achieve significantly better performance in predicting the Van Hiele level and demonstrated skills from open-ended teacher responses.
What carries the argument
The skills dictionary of 33 fine-grained reasoning skills derived from Van Hiele levels, incorporated into retrieval-augmented generation and multi-task learning classifiers.
If this is right
- Provides the first automated approach for classifying Van Hiele levels from open-ended responses.
- Supports large-scale evaluation of teachers' geometric content knowledge.
- Enables adaptive and personalized teacher learning systems based on diagnosed reasoning levels.
- Reduces time and cost compared to traditional manual expert analysis.
Where Pith is reading between the lines
- Similar skills-dictionary approaches could extend to assessing student geometric reasoning or other subject areas.
- Improving the quality of expert annotations might further boost classifier accuracy.
- Integration with real-time classroom tools could allow ongoing monitoring of teacher development.
Load-bearing premise
Expert annotations of the 226 responses with Van Hiele levels and demonstrated skills provide reliable ground truth for training and evaluating the classifiers.
What would settle it
A larger dataset where skills-aware classifiers do not outperform baselines, or where inter-annotator agreement on Van Hiele levels is low.
Figures
read the original abstract
Assessing teachers' geometric content knowledge is essential for geometry instructional quality and student learning, but difficult to scale. The Van Hiele model characterizes geometric reasoning through five hierarchical levels. Traditional Van Hiele assessment relies on manual expert analysis of open-ended responses. This process is time-consuming, costly, and prevents large-scale evaluation. This study develops an automated approach for diagnosing teachers' Van Hiele reasoning levels using large language models grounded in educational theory. Our central hypothesis is that integrating explicit skills information significantly improves Van Hiele classification. In collaboration with mathematics education researchers, we built a structured skills dictionary decomposing the Van Hiele levels into 33 fine-grained reasoning skills. Through a custom web platform, 31 pre-service teachers solved geometry problems, yielding 226 responses. Expert researchers then annotated each response with its Van Hiele level and demonstrated skills from the dictionary. Using this annotated dataset, we implemented two classification approaches: (1) retrieval-augmented generation (RAG) and (2) multi-task learning (MTL). Each approach compared a skills-aware variant incorporating the skills dictionary against a baseline without skills information. Results showed that for both methods, skills-aware variants significantly outperformed baselines across multiple evaluation metrics. This work provides the first automated approach for Van Hiele level classification from open-ended responses. It offers a scalable, theory-grounded method for assessing teachers' geometric reasoning that can enable large-scale evaluation and support adaptive, personalized teacher learning systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops an automated LLM-based approach to classify pre-service teachers' Van Hiele geometric reasoning levels from open-ended responses. It constructs a 33-item skills dictionary decomposing the five Van Hiele levels, collects 226 annotated responses from 31 teachers via a web platform, and evaluates two methods (RAG and MTL) by comparing skills-aware variants that incorporate the dictionary against baselines without skills information, claiming that the skills-aware versions significantly outperform the baselines on multiple metrics.
Significance. If the central results hold after addressing label reliability, the work would be significant as the first automated, theory-grounded system for scaling Van Hiele assessment of teachers' geometric content knowledge. The explicit skills dictionary provides a falsifiable bridge between educational theory and LLM classification, with potential to support large-scale teacher evaluation and adaptive professional development. The dual-method design (RAG and MTL) and use of expert-annotated real responses strengthen the contribution relative to purely synthetic or ungrounded LLM prompting approaches.
major comments (2)
- Annotation process (described in the abstract and data collection): The 226 responses are annotated by expert researchers with Van Hiele levels and demonstrated skills from the 33-item dictionary to serve as ground truth for training and evaluating both the RAG and MTL classifiers. No inter-rater reliability statistics (e.g., Cohen's kappa, Fleiss' kappa, or raw agreement percentages) are reported, and there is no description of how multiple annotators resolved disagreements or validated consistent application of the skills dictionary. Because the headline claim that 'skills-aware variants significantly outperformed baselines' rests directly on these labels, any systematic noise or bias in the annotations would inflate the reported performance gap and prevent interpretation of the result as evidence for the value of the skills dictionary.
- Results and evaluation (abstract and corresponding results section): The abstract states that skills-aware variants 'significantly outperformed baselines across multiple evaluation metrics' but supplies no numerical values, statistical tests (e.g., p-values, confidence intervals), baseline definitions, dataset split details, or error analysis. Without these, it is impossible to assess the magnitude of improvement, whether the gains are robust to different splits, or whether they survive correction for multiple comparisons.
minor comments (2)
- The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy or F1 improvement) to allow readers to gauge the practical significance of the outperformance claim.
- Clarify the exact prompting strategy and retrieval mechanism used in the RAG skills-aware variant, including how the skills dictionary is injected at inference time.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which highlight important aspects of transparency in our annotation process and results reporting. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: Annotation process (described in the abstract and data collection): The 226 responses are annotated by expert researchers with Van Hiele levels and demonstrated skills from the 33-item dictionary to serve as ground truth for training and evaluating both the RAG and MTL classifiers. No inter-rater reliability statistics (e.g., Cohen's kappa, Fleiss' kappa, or raw agreement percentages) are reported, and there is no description of how multiple annotators resolved disagreements or validated consistent application of the skills dictionary. Because the headline claim that 'skills-aware variants significantly outperformed baselines' rests directly on these labels, any systematic noise or bias in the annotations would inflate the reported performance gap and prevent interpretation of the result as evidence for the value of the skills dictionary.
Authors: We agree that explicit reporting of inter-rater reliability is necessary to substantiate the ground-truth labels. The annotations were performed collaboratively by multiple expert researchers in mathematics education, with disagreements resolved via discussion until consensus was reached. In the revised manuscript, we will expand the data collection section to include: (i) the number of annotators, (ii) the full annotation protocol and training on the skills dictionary, (iii) raw agreement percentages, and (iv) Cohen's kappa (or Fleiss' kappa) computed separately for Van Hiele level assignments and for the 33-skill annotations. These additions will directly address concerns about label reliability and allow readers to assess the robustness of the performance gains. revision: yes
-
Referee: Results and evaluation (abstract and corresponding results section): The abstract states that skills-aware variants 'significantly outperformed baselines across multiple evaluation metrics' but supplies no numerical values, statistical tests (e.g., p-values, confidence intervals), baseline definitions, dataset split details, or error analysis. Without these, it is impossible to assess the magnitude of improvement, whether the gains are robust to different splits, or whether they survive correction for multiple comparisons.
Authors: We acknowledge that the submitted abstract and results section did not provide sufficient quantitative detail for independent evaluation. The manuscript does contain the underlying performance tables and comparisons, but we agree these must be foregrounded. In the revision we will: (i) update the abstract to report the key metric improvements (e.g., accuracy/F1 deltas) and associated p-values, (ii) add explicit definitions of the baselines, (iii) detail the train/test splits and any cross-validation procedure, (iv) include confidence intervals and error analysis, and (v) state whether multiple-comparison corrections were applied. These changes will enable readers to judge both the magnitude and statistical robustness of the reported gains. revision: yes
Circularity Check
No significant circularity: standard supervised ML on external expert annotations
full rationale
The paper's derivation consists of (1) constructing a 33-item skills dictionary in collaboration with education researchers, (2) collecting 226 open-ended responses from 31 pre-service teachers, (3) having expert researchers annotate each response with a Van Hiele level plus demonstrated skills from the dictionary, and (4) training/evaluating two classifiers (RAG and MTL) that compare a skills-aware variant against a baseline without skills information. All performance claims are measured against the held-out expert annotations as ground truth. This is a conventional empirical pipeline; the target labels are produced by independent human experts rather than being defined by the model's own outputs or by any fitted parameter that is later renamed as a prediction. No equations, self-citations, or uniqueness theorems appear in the provided text that would reduce the reported outperformance to a definitional identity. The central result therefore remains falsifiable by new annotations or new data and does not collapse by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The Van Hiele model accurately characterizes geometric reasoning through five hierarchical levels.
- domain assumption Expert researchers can reliably annotate responses with both Van Hiele levels and the 33 specific skills.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the AAAI Conference on Artificial Intelligence 36(4), 4424–4431 (Jun 2022)
An, S., Kim, J., Kim, M., Park, J.: No task left behind: Mult i-task learning of knowledge tracing and option tracing for better student ass essment. Proceedings of the AAAI Conference on Artificial Intelligence 36(4), 4424–4431 (Jun 2022). https://doi.org/10.1609/aaai.v36i4.20364
-
[2]
Armah, R.B., Cofie, P.O., Okpoti, C.A.: Investigating the effect of Van Hiele phase-based instruction on pre-service teachers’ geometr ic thinking. Interna- tional Journal of Research in Education and Science 4(1), 314–330 (2018), https://eric.ed.gov/?id=EJ1169856
work page 2018
-
[3]
Digital Di scovery 4, 500–512 (2025)
Bajan, C., Lambard, G.: Exploring the expertise of large l anguage models in ma- terials science and metallurgical engineering. Digital Di scovery 4, 500–512 (2025). https://doi.org/10.1039/D4DD00319E
-
[4]
Mathematics Teacher Education and D evelopment 14(2), 70–90 (2012)
Beswick, K., Goos, M.: Measuring pre-service primary tea chers’ knowledge for teaching mathematics. Mathematics Teacher Education and D evelopment 14(2), 70–90 (2012)
work page 2012
-
[5]
The elementary school journal 111(3), 430–454 (2011)
Campbell, P.F., Malkus, N.N.: The impact of elementary ma thematics coaches on student achievement. The elementary school journal 111(3), 430–454 (2011). https://doi.org/10.1086/657654
-
[6]
Computers & ed ucation 210 (2024)
Copur-Gencturk, Y., Li, J., Cohen, A.S., Orrill, C.H.: Th e impact of an interactive, personalized computer-based teacher professional develo pment program on student performance: A randomized controlled trial. Computers & ed ucation 210 (2024). https://doi.org/10.1016/j.compedu.2023.104963
-
[7]
Learning and teaching geometry, K-12 1, 1–16 (1987)
Crowley, M.L.: The van hiele model of the development of ge ometric thought. Learning and teaching geometry, K-12 1, 1–16 (1987)
work page 1987
-
[8]
, Mazaré, P.E., Lomeli, M., Hosseini, L., Jégou, H.: The faiss library
Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G. , Mazaré, P.E., Lomeli, M., Hosseini, L., Jégou, H.: The faiss library. IEEE Transac tions on Big Data pp. 1–17 (2025). https://doi.org/10.1109/TBDATA.2025.3618474
-
[9]
Fateen, M., Wang, B., Mine, T.: Beyond scores: A modular ra g-based sys- tem for automatic short answer scoring with feedback. IEEE A ccess 12 (2024). https://doi.org/10.1109/ACCESS.2024.3508747
-
[10]
: Predictive stu- dent modeling in educational games with multi-task learnin g
Geden, M., Emerson, A., Rowe, J., Azevedo, R., Lester, J. : Predictive stu- dent modeling in educational games with multi-task learnin g. Proceedings of the AAAI Conference on Artificial Intelligence 34(01), 654–661 (Apr 2020). https://doi.org/10.1609/aaai.v34i01.5406
-
[11]
Journal for Research in Mathematics Education 22(3), 237–251 (1991)
Gutiérrez, A., Jaime, A., Fortuny, J.M.: An alternative paradigm to evaluate the acquisition of the van hiele levels. Journal for Research in Mathematics Education 22(3), 237–251 (1991)
work page 1991
-
[12]
In- ternational journal of artificial intelligence in educatio n 35(2), 651–676 (2025)
Henkel, O., Hills, L., Roberts, B., McGrane, J.: Can llms grade open response reading comprehension questions? an empirical study using the roars dataset. In- ternational journal of artificial intelligence in educatio n 35(2), 651–676 (2025). https://doi.org/10.1007/s40593-024-00431-z
-
[13]
van Hiele, P.: The child’s thought and geometry. ERIC Arc hive (translated from original Dutch) (1959), originally presented at the OEEC co nference, Sèvres, 1957
work page 1959
-
[14]
In: International Conference on Artificial Intelligence in Education
Huang, G.Y., Chen, J., Liu, H., Fu, W., Ding, W., Tang, J., Yang, S., Li, G., Liu, Z.: Neural multi-task learning for teacher question detect ion in online classrooms. In: International Conference on Artificial Intelligence in Education. pp. 269–281. Springer (2020). https://doi.org/10.1007/978-3-030-52237-7_22
-
[15]
In: Proceedings of the 18th PME Conference
Jaime, A., Gutiérrez, A.: A model of test design to assess the van hiele levels. In: Proceedings of the 18th PME Conference. vol. 3, pp. 41–48. Pm e (1994) Automatically Inferring Teachers’ Geometric Content Know ledge 15
work page 1994
- [16]
-
[17]
Journal of Physics: Con ference Series 1013(1), 012117 (may 2018)
Jupri, A.: Using the van hiele theory to analyze primary s chool teachers’ written work on geometrical proof problems. Journal of Physics: Con ference Series 1013(1), 012117 (may 2018). https://doi.org/10.1088/1742-6596/1013/1/012117
-
[18]
International Electronic Journal of Elementa ry Education 12(4), 303– 309 (2020)
Kurt-Birel, G., Deniz, S., Önel, F.: Analysis of primary school teachers’ knowledge of geometry. International Electronic Journal of Elementa ry Education 12(4), 303– 309 (2020). https://doi.org/10.26822/iejee.2020459459
-
[19]
Com- puters and Education: Artificial Intelligence 6, 100213 (2024)
Lee, G.G., Latif, E., Wu, X., Liu, N., Zhai, X.: Applying l arge language models and chain-of-thought for automatic scorin g. Com- puters and Education: Artificial Intelligence 6, 100213 (2024). https://doi.org/https://doi.org/10.1016/j.caeai.2024.100213
- [20]
-
[21]
In- ternational Journal of Studies in Education and Science (IJ SES) 4(2), 113–123 (2023)
Lumbre, A.P., Beltran-Joaquin, M.N., Monterola, S.L.C .: Relationship between mathematics teachers’ van hiele levels and students’ achie vement in geometry. In- ternational Journal of Studies in Education and Science (IJ SES) 4(2), 113–123 (2023)
work page 2023
-
[22]
Mathematics Teaching-Resea rch Journal 13(3), 99– 130 (2021)
Manero, V., Arnal-Bailera, A.: Understanding proof pra ctices of pre-service math- ematics teachers in geometry. Mathematics Teaching-Resea rch Journal 13(3), 99– 130 (2021)
work page 2021
-
[23]
Journal for research in mathematics educati on 14(1), 58–69 (1983)
Mayberry, J.: The van hiele levels of geometric thought i n undergraduate preser- vice teachers. Journal for research in mathematics educati on 14(1), 58–69 (1983). https://doi.org/10.5951/jresematheduc.14.1.0058
-
[24]
Disciplinary and Interdisciplinary Science Educat ion Research 8(3) (2026)
Rachmatullah, A., Tayde, S., Alozie, N., et al.: Explori ng large language model’s ca- pabilities in identifying science teacher pck using lesson plans and open-ended ques- tions. Disciplinary and Interdisciplinary Science Educat ion Research 8(3) (2026). https://doi.org/10.1186/s43031-025-00151-x
-
[25]
Journal for Research in Mathem atics Education 28(4), 467–483 (1997)
Swafford, J.O., Jones, G.A., Thornton, C.A.: Increased k nowledge in geometry and instructional practice. Journal for Research in Mathem atics Education 28(4), 467–483 (1997). https://doi.org/10.5951/jresematheduc.28.4.0467
-
[26]
Journal of Physics: Conference Series 1882(1), 012042 (may 2021)
Tamam, B., Dasari, D.: The use of geogebra software in tea ching mathe- matics. Journal of Physics: Conference Series 1882(1), 012042 (may 2021). https://doi.org/10.1088/1742-6596/1882/1/012042
-
[27]
Gemini: A Family of Highly Capable Multimodal Models
Team, G., et al.: Gemini: a family of highly capable multi modal models. arXiv preprint arXiv:2312.11805 (2023). https://doi.org/10.48550/arXiv.2312.11805
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023
-
[28]
Team, G., et al.: Gemma 3 technical report. arXiv preprin t arXiv:2503.19786 (2025). https://doi.org/10.48550/arXiv.2503.19786
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786 2025
-
[29]
Usiskin, Z.: Van hiele levels and achievement in seconda ry school geometry. cdassg project. ERIC (1982)
work page 1982
-
[30]
Xu, M., Huang, K., Qi, X.: Multi-task learning with conte xt-oriented self-attention for breast ultrasound image classification and segmentatio n. In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI) . pp. 1–5 (2022). https://doi.org/10.1109/ISBI52829.2022.9761685
-
[31]
Teaching and Teacher Education 91, 103038 (2020)
Yi, M., Flores, R., Wang, J.: Examining the influence of va n hiele theory-based instructional activities on elementary preservice teache rs’ geometry knowledge for teaching 2-d shapes. Teaching and Teacher Education 91, 103038 (2020). https://doi.org/https://doi.org/10.1016/j.tate.2020.103038
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.