Developing Authentic Simulated Learners for Mathematics Teacher Learning: Insights from Three Approaches with Large Language Models
Pith reviewed 2026-05-10 20:09 UTC · model grok-4.3
The pith
Three LLM methods make simulated elementary math students more authentic than few-shot prompting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
All three approaches improve cognitive and linguistic authenticity of simulated students compared with few-shot prompts. Interviews reveal that the fine-tuned model produces realistic brief responses but limits opportunities to extend students' thinking, whereas the multi-agent and DPO approaches generate explicit reasoning behind student strategies.
What carries the argument
Three LLM enhancement techniques—fine-tuning on authentic student data, multi-agent collaboration, and direct preference optimization (DPO)—used to generate student-like responses to mathematics problems.
Load-bearing premise
That the observed authenticity gains and distinct affordances from a small interview sample will produce better teacher noticing skills in actual classroom practice.
What would settle it
A controlled experiment comparing pre-service teachers' accuracy in identifying and responding to live student mathematical thinking after training with the three methods versus few-shot baselines.
Figures
read the original abstract
Large Language Model (LLM) simulations, where LLMs act as students with varying approaches to learning tasks, can support teachers' noticing of student thinking. However, simulations using zero- or few-shot prompting often yield inauthentic knowledge and language, directing teachers to unrealistic reasoning. We evaluate three approaches (Fine-tuning, Multi-agent, and Direct Preference Optimization; DPO) to improve the authenticity and pedagogical utility of simulated students. All approaches improve cognitive and linguistic authenticity, compared with few-shot prompts. Interviews with elementary mathematics pre-service teachers and researchers (\textit{n} = 8) reveal distinct pedagogical affordances. The fine-tuned model produces realistic, brief responses but limits opportunities to extend students' thinking. Meanwhile, the multi-agent and DPO approaches generate explicit reasoning behind student strategies. We discuss implications for designing LLM simulations that balance authenticity with instructional utility for teacher learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates three approaches (fine-tuning, multi-agent systems, and Direct Preference Optimization) for generating LLM-simulated elementary mathematics students. It claims that all three improve cognitive and linguistic authenticity relative to few-shot prompting baselines, and that interviews with n=8 pre-service teachers and researchers reveal distinct pedagogical affordances, with fine-tuned models yielding brief realistic responses while multi-agent and DPO approaches surface explicit student reasoning.
Significance. If substantiated, the comparative analysis of prompting and alignment techniques could guide the design of LLM simulations that better support teachers' noticing of student mathematical thinking. The explicit trade-off discussion (authenticity vs. opportunities to extend thinking) is a constructive contribution to HCI and teacher-education technology.
major comments (2)
- [§5] §5 (Evaluation of authenticity): the central claim that the three approaches improve cognitive and linguistic authenticity over few-shot prompts rests on author judgment without reported quantitative metrics, statistical tests, error bars, rubrics, blinded raters, or direct comparison to real student data. This is load-bearing for the primary contribution.
- [§6] §6 (Interview findings): the n=8 qualitative interview sample is too small to establish that the observed stylistic differences reliably produce distinct pedagogical affordances or translate into improved teacher noticing skills; no pre/post measures or external validation are described.
minor comments (1)
- [Abstract] Abstract: the phrase 'improve cognitive and linguistic authenticity' would benefit from a one-sentence gloss on the operationalization used.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments highlight important opportunities to strengthen the transparency and framing of our qualitative evaluations. We respond to each major comment below and indicate revisions that will be incorporated in the next version of the manuscript.
read point-by-point responses
-
Referee: §5 (Evaluation of authenticity): the central claim that the three approaches improve cognitive and linguistic authenticity over few-shot prompts rests on author judgment without reported quantitative metrics, statistical tests, error bars, rubrics, blinded raters, or direct comparison to real student data. This is load-bearing for the primary contribution.
Authors: We agree that the authenticity evaluation relies on expert qualitative judgment rather than quantitative metrics or blinded raters. The analysis was grounded in established criteria from mathematics education research on student thinking and language use, with side-by-side comparisons to few-shot baselines and selected real student work samples. No statistical tests were performed because the study design was exploratory and focused on characterizing distinct output styles. In revision we will (1) explicitly describe the judgment rubric and criteria in §5, (2) add an appendix with additional annotated examples, and (3) more clearly label the evaluation as expert qualitative comparison rather than a controlled quantitative study. Direct large-scale comparison to real student corpora remains a valuable direction for future work but was outside the scope of the current paper. revision: partial
-
Referee: §6 (Interview findings): the n=8 qualitative interview sample is too small to establish that the observed stylistic differences reliably produce distinct pedagogical affordances or translate into improved teacher noticing skills; no pre/post measures or external validation are described.
Authors: The n=8 sample is small and the study is intentionally exploratory; we do not claim statistical generalizability or causal improvement in noticing skills. The interviews were designed to surface participant perceptions of pedagogical utility and to identify trade-offs (e.g., brevity vs. explicit reasoning) that can inform future simulation design. Consistent themes emerged across the eight participants, supporting the reported distinctions. We will revise §6 and the discussion to (1) explicitly frame the findings as preliminary insights from a small expert sample, (2) acknowledge the absence of pre/post or validated noticing measures, and (3) outline concrete directions for larger-scale validation studies. This framing aligns with common practice in HCI and teacher-education technology research when introducing novel simulation approaches. revision: partial
Circularity Check
No circularity: empirical comparison of LLM prompting methods
full rationale
This is a purely empirical study that evaluates three LLM-based simulation approaches (fine-tuning, multi-agent, DPO) against few-shot baselines via prompting experiments and n=8 interviews. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or described methods. Authenticity improvements are reported from direct comparisons and qualitative feedback rather than being constructed from the inputs by definition. The central claims rest on experimental outcomes and participant observations, which are independent of any self-citation chain or ansatz smuggling.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Cognitive and linguistic authenticity of simulated student responses can be meaningfully improved and assessed relative to few-shot baselines
- ad hoc to paper Interviews with n=8 pre-service teachers and researchers are adequate to reveal distinct pedagogical affordances of each simulation approach
Reference graph
Works this paper leans on
-
[1]
Barrett, A., Ke, F., Zhang, N., Dai, C.P., Bhowmik, S., Yuan, X.: Pattern anal- ysis of ambitious science talk between preservice teachers and ai-powered student agents. In: LAK ’25. p. 761–770. Association for Computing Machinery (2025)
work page 2025
-
[2]
Cao, J., Zhao, C.Q., Chen, X., Wang, S., Schunn, C., Koedinger, K.R., Lin, J.: From first draft to final insight: a multi-agent approach for feedback generation. In: AIED’25. pp. 163–176. Springer (2025)
work page 2025
-
[3]
Educational evaluation and policy analysis42(2), 208–231 (2020)
Cohen, J., Wong, V., Krishnamachari, A., Berlin, R.: Teacher coaching in a sim- ulated environment. Educational evaluation and policy analysis42(2), 208–231 (2020)
work page 2020
-
[4]
Grossman, P., Compton, C., Igra, D., Ronfeldt, M., Shahan, E., Williamson, P.W.: Teachingpractice:Across-professionalperspective.Teach.Coll.Rec.111(9),2055– 2100 (2009)
work page 2055
-
[5]
Jacobs, V.R., Lamb, L.L., Philipp, R.A.: Professional noticing of children’s math- ematical thinking. J. Res. Math. Educ.41(2), 169–202 (2010)
work page 2010
-
[6]
Jin, H., Yoo, M., Park, J., Lee, Y., Wang, X., Kim, J.: Teachtune: Reviewing pedagogical agents against diverse student profiles with simulated students. In: CHI’25. pp. 1–28 (2025)
work page 2025
-
[7]
Kilic, H., Dogan, O.: Preservice mathematics teachers’ noticing in action and in reflection.InternationalJournalofScienceandMathematicsEducation20(2),345– 366 (2022)
work page 2022
-
[8]
Lampert, M., Franke, M.L., Kazemi, E., Ghousseini, H., Turrou, A.C., Beasley, H., Cunard, A., Crowe, K.: Keeping it complex: Using rehearsals to support novice teacher learning of ambitious teaching. J. Teach. Educ.64(3), 226–243 (2013)
work page 2013
-
[9]
Lee, T.D., Lee, C., Newton, M., Vos, P., Gallagher, J., Dickerson, D., Regenthal, C.: Peer to peer vs. virtual rehearsal simulation rehearsal contexts: Elementary teacher candidates’ scientific discourse skills explored. J. Sci. Teach. Educ.35(1), 63–84 (2024)
work page 2024
-
[10]
Li, X., Wang, S., Zeng, S., Wu, Y., Yang, Y.: A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth1(1), 9 (2024)
work page 2024
-
[11]
Liu, N., Sonkar, S., Baraniuk, R.: Do llms make mistakes like students? explor- ing natural alignments between language models and human error patterns. In: AIED’25. pp. 364–377. Springer (2025) Developing Authentic Simulated Learners 9
work page 2025
- [12]
- [13]
-
[14]
Mikeska, J.N., Francis, D.C., Lottero-Perdue, P.S., Rogers, M.P., Shekell, C., Bharaj, P.K., Howell, H., Maltese, A., Thompson, M., Reich, J.: Promoting pre- service teachers’ facilitation of argumentation in mathematics and science through digital simulations. Teach. Teach. Educ.154, 104858 (2025)
work page 2025
-
[15]
EdArXiv Preprints (2024),https://osf.io/preprints/edarxiv/5zwv3_v1
Miller, P., Dicerbo, K.: Llm based math tutoring: Challenges and dataset. EdArXiv Preprints (2024),https://osf.io/preprints/edarxiv/5zwv3_v1
work page 2024
-
[16]
Pan, S., Schmucker, R., Garcia Bulle Bueno, B., Llanes, S.A., Albo Alarcón, F., Zhu, H., Teo, A., Xia, M.: Tutorup: What if your students were simulated? training tutors to address engagement challenges in online learning. In: CHI’25. pp. 1–18 (2025)
work page 2025
-
[17]
Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Adv. Neural Inf. Process. Syst.36, 53728–53741 (2023)
work page 2023
-
[18]
Richter, E., Hußner, I., Huang, Y., Richter, D., Lazarides, R.: Video-based reflec- tion in teacher education: Comparing virtual reality and real classroom videos. Comput. Educ.190, 104601 (2022)
work page 2022
-
[19]
In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Scarlatos, A., Fernandez, N., Ormerod, C., Lottridge, S., Lan, A.: Smart: Simulated students aligned with item response theory for question difficulty prediction. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 25082–25105 (2025)
work page 2025
-
[20]
Scarlatos, A., Smith, D., Woodhead, S., Lan, A.: Improving the validity of auto- matically generated feedback via reinforcement learning. In: AIED’24. pp. 280–294. Springer (2024)
work page 2024
-
[21]
Shinn,N.,Cassano,F.,Gopinath,A.,Narasimhan,K.,Yao,S.:Reflexion:Language agentswithverbalreinforcementlearning.Adv.NeuralInf.Process.Syst.36,8634– 8652 (2023)
work page 2023
-
[22]
In: Proceedings of the thirteenth language resources and evaluation conference
Suresh, A., Jacobs, J., Harty, C., Perkoff, M., Martin, J.H., Sumner, T.: The talk- moves dataset: K-12 mathematics lesson transcripts annotated for teacher and student discursive moves. In: Proceedings of the thirteenth language resources and evaluation conference. pp. 4654–4662 (2022)
work page 2022
- [23]
-
[24]
Xu, S., Wen, H.N., Pan, H., Dominguez, D., Hu, D., Zhang, X.: Classroom simu- lacra:Buildingcontextualstudentgenerativeagentsinonlineeducationforlearning behavioral simulation. In: CHI’25. pp. 1–26 (2025)
work page 2025
-
[25]
Zhang, N., Ke, F., Dai, C.P., Southerland, S.A., Yuan, X.: Seeking to support pre- service teachers’ responsive teaching: Leveraging artificial intelligence-supported virtual simulation. Br. J. Educ. Technol.56(3), 1148–1169 (2025)
work page 2025
-
[26]
Zheng, L., He, A., Qi, C., Zhang, H., Gu, X.: Cognitive echo: Enhancing think- aloud protocols with llm-based simulated students. Br. J. Educ. Technol. (2025)
work page 2025
-
[27]
Zheng, L., Jiang, F., Gu, X., Li, Y., Wang, G., Zhang, H.: Teaching via llm- enhanced simulations: Authenticity and barriers to suspension of disbelief. Internet High. Educ.65, 100990 (2025)
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.