pith. sign in

arxiv: 2604.04361 · v1 · submitted 2026-04-06 · 💻 cs.HC

Developing Authentic Simulated Learners for Mathematics Teacher Learning: Insights from Three Approaches with Large Language Models

Pith reviewed 2026-05-10 20:09 UTC · model grok-4.3

classification 💻 cs.HC
keywords LLM simulationsmathematics teacher educationsimulated learnersauthenticityfine-tuningmulti-agentdirect preference optimizationnoticing student thinking
0
0 comments X

The pith

Three LLM methods make simulated elementary math students more authentic than few-shot prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates fine-tuning, multi-agent systems, and direct preference optimization to generate LLM responses that better match real elementary students' thinking and language during math tasks. This matters because inauthentic simulations can steer teachers toward unrealistic expectations about how students reason. All three methods raise cognitive and linguistic authenticity over basic prompting. Interviews with eight pre-service teachers and researchers show the fine-tuned model yields brief, realistic answers that limit follow-up, while multi-agent and DPO versions surface the reasoning behind student strategies.

Core claim

All three approaches improve cognitive and linguistic authenticity of simulated students compared with few-shot prompts. Interviews reveal that the fine-tuned model produces realistic brief responses but limits opportunities to extend students' thinking, whereas the multi-agent and DPO approaches generate explicit reasoning behind student strategies.

What carries the argument

Three LLM enhancement techniques—fine-tuning on authentic student data, multi-agent collaboration, and direct preference optimization (DPO)—used to generate student-like responses to mathematics problems.

Load-bearing premise

That the observed authenticity gains and distinct affordances from a small interview sample will produce better teacher noticing skills in actual classroom practice.

What would settle it

A controlled experiment comparing pre-service teachers' accuracy in identifying and responding to live student mathematical thinking after training with the three methods versus few-shot baselines.

Figures

Figures reproduced from arXiv: 2604.04361 by Boran Yu, Dionne Cross Francis, Ha Nguyen, Jie Cao, Pavneet Kaur Bharaj, Selim Yavuz, Shuguang Wang.

Figure 1
Figure 1. Figure 1: Prompt and interface for the simulated agent (Josh) [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the approaches. LLM selection considering commercial (e.g., GPT) and open-source LLMs (Llama, Mistral) from testing-feedback in Summer-Fall 2025. Multi-agent The Multi-agent approach involved decomposing a main objec￾tive into specialized tasks for distinct collaborating agents using gpt-4o [2,10]. We designed three agents (see Appendix B for prompts). An Initial Respon￾der outputted the respon… view at source ↗
Figure 3
Figure 3. Figure 3: Overall performance of the three approaches: Cognition and Language 4.2 RQ2: Educators’ Feedback: Authenticity & Pedagogical Utility Participants preferred the DPO version ( [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Participants’ rankings for preference and authenticity The interview results suggested that participants perceived responses from all approaches as authentic but in qualitatively different ways. Natural in￾teractions. Most participants evaluated that the Fine-tuning approach produced shorter responses that mirrored classroom instances when a student “isn’t super into it” (Alice, PST) or “is just waiting fo… view at source ↗
read the original abstract

Large Language Model (LLM) simulations, where LLMs act as students with varying approaches to learning tasks, can support teachers' noticing of student thinking. However, simulations using zero- or few-shot prompting often yield inauthentic knowledge and language, directing teachers to unrealistic reasoning. We evaluate three approaches (Fine-tuning, Multi-agent, and Direct Preference Optimization; DPO) to improve the authenticity and pedagogical utility of simulated students. All approaches improve cognitive and linguistic authenticity, compared with few-shot prompts. Interviews with elementary mathematics pre-service teachers and researchers (\textit{n} = 8) reveal distinct pedagogical affordances. The fine-tuned model produces realistic, brief responses but limits opportunities to extend students' thinking. Meanwhile, the multi-agent and DPO approaches generate explicit reasoning behind student strategies. We discuss implications for designing LLM simulations that balance authenticity with instructional utility for teacher learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper evaluates three approaches (fine-tuning, multi-agent systems, and Direct Preference Optimization) for generating LLM-simulated elementary mathematics students. It claims that all three improve cognitive and linguistic authenticity relative to few-shot prompting baselines, and that interviews with n=8 pre-service teachers and researchers reveal distinct pedagogical affordances, with fine-tuned models yielding brief realistic responses while multi-agent and DPO approaches surface explicit student reasoning.

Significance. If substantiated, the comparative analysis of prompting and alignment techniques could guide the design of LLM simulations that better support teachers' noticing of student mathematical thinking. The explicit trade-off discussion (authenticity vs. opportunities to extend thinking) is a constructive contribution to HCI and teacher-education technology.

major comments (2)
  1. [§5] §5 (Evaluation of authenticity): the central claim that the three approaches improve cognitive and linguistic authenticity over few-shot prompts rests on author judgment without reported quantitative metrics, statistical tests, error bars, rubrics, blinded raters, or direct comparison to real student data. This is load-bearing for the primary contribution.
  2. [§6] §6 (Interview findings): the n=8 qualitative interview sample is too small to establish that the observed stylistic differences reliably produce distinct pedagogical affordances or translate into improved teacher noticing skills; no pre/post measures or external validation are described.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'improve cognitive and linguistic authenticity' would benefit from a one-sentence gloss on the operationalization used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important opportunities to strengthen the transparency and framing of our qualitative evaluations. We respond to each major comment below and indicate revisions that will be incorporated in the next version of the manuscript.

read point-by-point responses
  1. Referee: §5 (Evaluation of authenticity): the central claim that the three approaches improve cognitive and linguistic authenticity over few-shot prompts rests on author judgment without reported quantitative metrics, statistical tests, error bars, rubrics, blinded raters, or direct comparison to real student data. This is load-bearing for the primary contribution.

    Authors: We agree that the authenticity evaluation relies on expert qualitative judgment rather than quantitative metrics or blinded raters. The analysis was grounded in established criteria from mathematics education research on student thinking and language use, with side-by-side comparisons to few-shot baselines and selected real student work samples. No statistical tests were performed because the study design was exploratory and focused on characterizing distinct output styles. In revision we will (1) explicitly describe the judgment rubric and criteria in §5, (2) add an appendix with additional annotated examples, and (3) more clearly label the evaluation as expert qualitative comparison rather than a controlled quantitative study. Direct large-scale comparison to real student corpora remains a valuable direction for future work but was outside the scope of the current paper. revision: partial

  2. Referee: §6 (Interview findings): the n=8 qualitative interview sample is too small to establish that the observed stylistic differences reliably produce distinct pedagogical affordances or translate into improved teacher noticing skills; no pre/post measures or external validation are described.

    Authors: The n=8 sample is small and the study is intentionally exploratory; we do not claim statistical generalizability or causal improvement in noticing skills. The interviews were designed to surface participant perceptions of pedagogical utility and to identify trade-offs (e.g., brevity vs. explicit reasoning) that can inform future simulation design. Consistent themes emerged across the eight participants, supporting the reported distinctions. We will revise §6 and the discussion to (1) explicitly frame the findings as preliminary insights from a small expert sample, (2) acknowledge the absence of pre/post or validated noticing measures, and (3) outline concrete directions for larger-scale validation studies. This framing aligns with common practice in HCI and teacher-education technology research when introducing novel simulation approaches. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison of LLM prompting methods

full rationale

This is a purely empirical study that evaluates three LLM-based simulation approaches (fine-tuning, multi-agent, DPO) against few-shot baselines via prompting experiments and n=8 interviews. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or described methods. Authenticity improvements are reported from direct comparisons and qualitative feedback rather than being constructed from the inputs by definition. The central claims rest on experimental outcomes and participant observations, which are independent of any self-citation chain or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on domain assumptions about measurable authenticity in LLM outputs and the sufficiency of small-scale qualitative data for identifying pedagogical differences; no free parameters or new entities are introduced.

axioms (2)
  • domain assumption Cognitive and linguistic authenticity of simulated student responses can be meaningfully improved and assessed relative to few-shot baselines
    Invoked to support the claim that all three approaches outperform few-shot prompting.
  • ad hoc to paper Interviews with n=8 pre-service teachers and researchers are adequate to reveal distinct pedagogical affordances of each simulation approach
    Used to differentiate the practical utilities of the fine-tuned, multi-agent, and DPO models.

pith-pipeline@v0.9.0 · 5469 in / 1443 out tokens · 69711 ms · 2026-05-10T20:09:55.310648+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    In: LAK ’25

    Barrett, A., Ke, F., Zhang, N., Dai, C.P., Bhowmik, S., Yuan, X.: Pattern anal- ysis of ambitious science talk between preservice teachers and ai-powered student agents. In: LAK ’25. p. 761–770. Association for Computing Machinery (2025)

  2. [2]

    In: AIED’25

    Cao, J., Zhao, C.Q., Chen, X., Wang, S., Schunn, C., Koedinger, K.R., Lin, J.: From first draft to final insight: a multi-agent approach for feedback generation. In: AIED’25. pp. 163–176. Springer (2025)

  3. [3]

    Educational evaluation and policy analysis42(2), 208–231 (2020)

    Cohen, J., Wong, V., Krishnamachari, A., Berlin, R.: Teacher coaching in a sim- ulated environment. Educational evaluation and policy analysis42(2), 208–231 (2020)

  4. [4]

    Grossman, P., Compton, C., Igra, D., Ronfeldt, M., Shahan, E., Williamson, P.W.: Teachingpractice:Across-professionalperspective.Teach.Coll.Rec.111(9),2055– 2100 (2009)

  5. [5]

    Jacobs, V.R., Lamb, L.L., Philipp, R.A.: Professional noticing of children’s math- ematical thinking. J. Res. Math. Educ.41(2), 169–202 (2010)

  6. [6]

    In: CHI’25

    Jin, H., Yoo, M., Park, J., Lee, Y., Wang, X., Kim, J.: Teachtune: Reviewing pedagogical agents against diverse student profiles with simulated students. In: CHI’25. pp. 1–28 (2025)

  7. [7]

    Kilic, H., Dogan, O.: Preservice mathematics teachers’ noticing in action and in reflection.InternationalJournalofScienceandMathematicsEducation20(2),345– 366 (2022)

  8. [8]

    Lampert, M., Franke, M.L., Kazemi, E., Ghousseini, H., Turrou, A.C., Beasley, H., Cunard, A., Crowe, K.: Keeping it complex: Using rehearsals to support novice teacher learning of ambitious teaching. J. Teach. Educ.64(3), 226–243 (2013)

  9. [9]

    virtual rehearsal simulation rehearsal contexts: Elementary teacher candidates’ scientific discourse skills explored

    Lee, T.D., Lee, C., Newton, M., Vos, P., Gallagher, J., Dickerson, D., Regenthal, C.: Peer to peer vs. virtual rehearsal simulation rehearsal contexts: Elementary teacher candidates’ scientific discourse skills explored. J. Sci. Teach. Educ.35(1), 63–84 (2024)

  10. [10]

    Vicinagearth1(1), 9 (2024)

    Li, X., Wang, S., Zeng, S., Wu, Y., Yang, Y.: A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth1(1), 9 (2024)

  11. [11]

    In: AIED’25

    Liu, N., Sonkar, S., Baraniuk, R.: Do llms make mistakes like students? explor- ing natural alignments between language models and human error patterns. In: AIED’25. pp. 364–377. Springer (2025) Developing Authentic Simulated Learners 9

  12. [12]

    In: Proc

    MacNeil, S., Rogalska, M., Leinonen, J., Denny, P., Hellas, A., Crosland, X.: Syn- thetic students: A comparative study of bug distribution between large language models and computing students. In: Proc. CompEd 2024. pp. 137–143 (2024)

  13. [13]

    In: Proc

    Martynova,D.,Macina,J.,Daheim,N.,Yalcin,N.,Zhang,X.,Sachan,M.:Canllms effectively simulate human learners? teachers’ insights from tutoring llm students. In: Proc. BEA 2025. pp. 100–117 (2025)

  14. [14]

    Mikeska, J.N., Francis, D.C., Lottero-Perdue, P.S., Rogers, M.P., Shekell, C., Bharaj, P.K., Howell, H., Maltese, A., Thompson, M., Reich, J.: Promoting pre- service teachers’ facilitation of argumentation in mathematics and science through digital simulations. Teach. Teach. Educ.154, 104858 (2025)

  15. [15]

    EdArXiv Preprints (2024),https://osf.io/preprints/edarxiv/5zwv3_v1

    Miller, P., Dicerbo, K.: Llm based math tutoring: Challenges and dataset. EdArXiv Preprints (2024),https://osf.io/preprints/edarxiv/5zwv3_v1

  16. [16]

    In: CHI’25

    Pan, S., Schmucker, R., Garcia Bulle Bueno, B., Llanes, S.A., Albo Alarcón, F., Zhu, H., Teo, A., Xia, M.: Tutorup: What if your students were simulated? training tutors to address engagement challenges in online learning. In: CHI’25. pp. 1–18 (2025)

  17. [17]

    Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Adv. Neural Inf. Process. Syst.36, 53728–53741 (2023)

  18. [18]

    Richter, E., Hußner, I., Huang, Y., Richter, D., Lazarides, R.: Video-based reflec- tion in teacher education: Comparing virtual reality and real classroom videos. Comput. Educ.190, 104601 (2022)

  19. [19]

    In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Scarlatos, A., Fernandez, N., Ormerod, C., Lottridge, S., Lan, A.: Smart: Simulated students aligned with item response theory for question difficulty prediction. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 25082–25105 (2025)

  20. [20]

    In: AIED’24

    Scarlatos, A., Smith, D., Woodhead, S., Lan, A.: Improving the validity of auto- matically generated feedback via reinforcement learning. In: AIED’24. pp. 280–294. Springer (2024)

  21. [21]

    Shinn,N.,Cassano,F.,Gopinath,A.,Narasimhan,K.,Yao,S.:Reflexion:Language agentswithverbalreinforcementlearning.Adv.NeuralInf.Process.Syst.36,8634– 8652 (2023)

  22. [22]

    In: Proceedings of the thirteenth language resources and evaluation conference

    Suresh, A., Jacobs, J., Harty, C., Perkoff, M., Martin, J.H., Sumner, T.: The talk- moves dataset: K-12 mathematics lesson transcripts annotated for teacher and student discursive moves. In: Proceedings of the thirteenth language resources and evaluation conference. pp. 4654–4662 (2022)

  23. [23]

    ZDM Math

    Van Es, E.A., Sherin, M.G.: Expanding on prior conceptualizations of teacher noticing. ZDM Math. Educ.53(1), 17–27 (2021)

  24. [24]

    In: CHI’25

    Xu, S., Wen, H.N., Pan, H., Dominguez, D., Hu, D., Zhang, X.: Classroom simu- lacra:Buildingcontextualstudentgenerativeagentsinonlineeducationforlearning behavioral simulation. In: CHI’25. pp. 1–26 (2025)

  25. [25]

    Zhang, N., Ke, F., Dai, C.P., Southerland, S.A., Yuan, X.: Seeking to support pre- service teachers’ responsive teaching: Leveraging artificial intelligence-supported virtual simulation. Br. J. Educ. Technol.56(3), 1148–1169 (2025)

  26. [26]

    Zheng, L., He, A., Qi, C., Zhang, H., Gu, X.: Cognitive echo: Enhancing think- aloud protocols with llm-based simulated students. Br. J. Educ. Technol. (2025)

  27. [27]

    Internet High

    Zheng, L., Jiang, F., Gu, X., Li, Y., Wang, G., Zhang, H.: Teaching via llm- enhanced simulations: Authenticity and barriers to suspension of disbelief. Internet High. Educ.65, 100990 (2025)