AI-Driven Assessment of Human Tutors: Linking Training Performance to Real-Life Practice

Clara Brandt; Conrad Borchers; Danielle R. Thomas; Kenneth R. Koedinger; Marie Cynthia Abijuru Kamikazi

arxiv: 2606.18617 · v1 · pith:5E3NQXMEnew · submitted 2026-06-17 · 💻 cs.CY · cs.AI

AI-Driven Assessment of Human Tutors: Linking Training Performance to Real-Life Practice

Danielle R. Thomas , Marie Cynthia Abijuru Kamikazi , Clara Brandt , Conrad Borchers , Kenneth R. Koedinger This is my paper

Pith reviewed 2026-06-26 19:34 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords AI assessmenttutor trainingreal-life practicemixed-effects modelspedagogical skillstranscript analysislearning gainsgenerative AI

0 comments

The pith

Tutor training performance predicts real-life tutoring transcript scores with a 0.25 SD effect size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that scores from AI-evaluated training scenarios reliably forecast how tutors actually perform when working with real students. This connection matters because most training platforms stop at simulated practice and never check whether skills transfer to live sessions. Data from 86 tutors across 405 session-to-lesson pairs, analyzed with mixed-effects models, establish the predictive link. The work also reports that tutors recognize more pedagogical opportunities and execute skills at higher quality after training, though these changes appear gradual rather than sudden.

Core claim

An AI system using Gemini-2.5-pro scores both open responses in scenario-based training lessons and transcriptions of authentic remote math tutoring. Across 405 pairs, training performance significantly predicts real-life transcript scores at an effect size of 0.25 SD. Averaging open-response and multiple-choice scores during training provides the best prediction of real performance, though open responses alone are more predictive than multiple choice. Tutors achieve a 7.4 percent average learning gain, encounter pedagogical opportunities more often (61.1 percent to 68.9 percent), and show higher execution quality within those opportunities (65.5 percent to 68.1 percent), with changes follow

What carries the argument

The AI-driven scoring system that applies Gemini-2.5-pro with fixed prompts and rubrics to rate training responses and real tutoring transcripts, then links those ratings through mixed-effects models.

If this is right

Averaged training scores serve as the strongest single indicator of later real-life tutoring quality.
Open-response items during training capture more transferable skill information than multiple-choice items alone.
Post-training gains appear as increased frequency and better execution of pedagogical moves rather than an immediate jump.
Open release of datasets, prompts, and rubrics enables direct replication and refinement of the linking method.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Programs could use the same AI pipeline for ongoing monitoring of tutor cohorts instead of periodic human observation.
The approach might extend to other domains where short training exercises must predict performance in live, variable settings.
Because the prediction holds across many session-to-lesson pairs, training platforms could triage which tutors need extra practice before live work.

Load-bearing premise

The AI model with the supplied prompts and rubrics produces scores that accurately reflect true pedagogical skill quality without systematic bias in either training or real transcripts.

What would settle it

Independent human raters scoring the same real-life transcripts find no correlation with the AI-generated scores, or training performance shows no predictive relation to transcript quality in a fresh sample of tutors.

Figures

Figures reproduced from arXiv: 2606.18617 by Clara Brandt, Conrad Borchers, Danielle R. Thomas, Kenneth R. Koedinger, Marie Cynthia Abijuru Kamikazi.

**Figure 2.** Figure 2: Interrupted-time series research design. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Pretest-posttest instructional design [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Linear relationship between lesson performance and assessment of real-life [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

There exist numerous tutor training platforms. However, few provide AI-driven training and evaluation for human tutors based on real-life performance. We present an AI-driven system that assesses both open responses during training and authentic real-life tutoring. Unlike platforms that only assess learning through online training or simulations, our system utilizes Generative AI (Gemini-2.5-pro) to analyze transcriptions of authentic tutoring, measuring the transfer of tutor skills to real-life application. Human tutors instructing students remotely in math (N=86) completed six scenario-based lessons, averaging a significant 7.4% learning gain. Using mixed-effects models across 405 session-to-lesson pairs, we found that training performance significantly predicted real-life transcript scores with an effect size of 0.25 SD. Model comparison (AIC/BIC) indicated averaging open response and multiple choice performance during training predicted real-life tutor performance best, although open responses were comparatively more predictive. Exploratory analysis showed that after training, tutors were significantly more likely to encounter pedagogical opportunities to apply their skills (61.1% to 68.9%) and demonstrated higher execution quality within those opportunities (65.5% to 68.1%). Interrupted time series analysis suggested that these tutor improvements were part of a gradual trend over time rather than an immediate intervention effect of training. We illustrate an AI-driven method to link tutor training with real-life assessment. In doing so, we contribute open datasets, AI prompts, and scoring rubrics to support transparency and reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a modest 0.25 SD link from AI-scored training to real tutoring transcripts but rests on unvalidated Gemini-2.5-pro outputs for both sides.

read the letter

The main thing to know is that this work scores both training open responses and real tutoring transcripts with Gemini-2.5-pro, then uses mixed-effects models on 405 pairs to find that training performance predicts real-life scores at 0.25 SD. They also report tutors improved in encountering and handling pedagogical opportunities over time, though the interrupted time series points to a gradual trend rather than an immediate training jump. The 7.4% learning gain in the six scenario lessons is noted but secondary.

What the paper does reasonably well is apply an off-the-shelf generative model to link training data with authentic remote math tutoring transcripts from 86 tutors, and it contributes the datasets, prompts, and rubrics openly. That resource contribution is useful for anyone trying to replicate or extend this kind of transfer measurement.

The soft spot is the AI scoring step itself. No human-AI agreement, inter-rater reliability, or validation subset is mentioned in the abstract, so the 0.25 effect could simply reflect how the model applies the rubric rather than genuine skill transfer. The model comparison favoring averaged open-response and multiple-choice scores is fine as far as it goes, but without scoring validation the central claim stays provisional. The circularity concern does not appear to apply here since the real-life scores are independent.

This is for researchers working on AI-assisted tutor training and assessment in education. A reader who wants concrete numbers on training-to-practice links and open materials will find something usable. It has enough empirical grounding and data contribution to deserve a serious referee, mainly to check the scoring validation and model specs in the full methods.

Referee Report

2 major / 2 minor

Summary. The paper presents an AI-driven assessment system that uses Gemini-2.5-pro to score open-response answers during six scenario-based tutor training lessons and to score authentic real-life tutoring transcripts. With N=86 tutors and 405 session-to-lesson pairs, it reports a 7.4% learning gain from training and, via mixed-effects models, a statistically significant prediction from training performance to real-life transcript scores (effect size 0.25 SD). Model comparison favors averaging open-response and multiple-choice training scores; exploratory interrupted time-series analyses indicate gradual increases in both the frequency of pedagogical opportunities (61.1% to 68.9%) and execution quality within them (65.5% to 68.1%). The manuscript contributes open datasets, prompts, and rubrics.

Significance. If the AI scoring is shown to be valid, the work supplies direct evidence that training performance transfers to real tutoring practice and offers a scalable, reproducible pipeline for linking the two. The explicit release of datasets, prompts, and rubrics is a concrete strength that supports transparency and future replication.

major comments (2)

[Abstract / Methods] Abstract and Methods: the 0.25 SD effect size and all downstream claims rest entirely on Gemini-2.5-pro scores for both training responses and real-life transcripts, yet no human-AI agreement statistics, inter-rater reliability on a validation subset, or rubric-bias checks are reported. This is load-bearing for the central prediction claim.
[Results] Results (mixed-effects models): the model specifications (random-effects structure, covariates, handling of the 405 pairs, and any data-exclusion rules) are not described, preventing assessment of whether the reported effect size is robust or sensitive to analytic choices.

minor comments (2)

[Abstract] The abstract states that 'averaging open response and multiple choice performance' was optimal but does not specify the exact weighting or aggregation procedure.
[Methods] Clarify how the 405 session-to-lesson pairs were constructed from the N=86 tutors and six lessons (e.g., whether all tutors contributed equally).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and will revise the manuscript to improve transparency and validity reporting.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods: the 0.25 SD effect size and all downstream claims rest entirely on Gemini-2.5-pro scores for both training responses and real-life transcripts, yet no human-AI agreement statistics, inter-rater reliability on a validation subset, or rubric-bias checks are reported. This is load-bearing for the central prediction claim.

Authors: We agree that human validation of the AI scores is essential to support the central claims. The initial submission omitted these statistics. In the revision we will add a new subsection to Methods that describes a human-rated validation subset (approximately 20% of responses stratified by lesson and tutor), reports agreement metrics (e.g., Cohen’s κ and percentage agreement) between Gemini-2.5-pro and human raters, human inter-rater reliability, and any systematic rubric-bias checks. These additions will directly address the load-bearing concern. revision: yes
Referee: [Results] Results (mixed-effects models): the model specifications (random-effects structure, covariates, handling of the 405 pairs, and any data-exclusion rules) are not described, preventing assessment of whether the reported effect size is robust or sensitive to analytic choices.

Authors: We acknowledge the need for full model transparency. The models were fit with lmerTest in R using random intercepts for tutors and for the 405 session-to-lesson pairs; fixed effects included training performance (averaged open-response and multiple-choice scores), tutor experience, and lesson order. No observations were excluded beyond those with missing transcript data. In the revision we will expand the Methods section with the complete model equations, random-effects structure, covariate list, data-handling rules, and results of sensitivity checks (alternative random-effects specifications and covariate sets) to allow readers to evaluate robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claim applies standard mixed-effects regression to 405 independent session-to-lesson pairs, predicting AI-scored real-life transcript quality from separately AI-scored training performance. Training open responses and real transcripts are distinct data sources; neither is defined in terms of the other, and no equation or model fit reduces the reported 0.25 SD effect to a tautology or to a fitted parameter renamed as prediction. No self-citation chain, uniqueness theorem, or ansatz smuggling is invoked to justify the link. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unvalidated accuracy of Gemini-2.5-pro transcript scoring and the representativeness of the 86 tutors and 405 session pairs; standard statistical assumptions for mixed-effects models are also required but not detailed.

axioms (1)

standard math Mixed-effects models assume normally distributed residuals and random effects.
Invoked implicitly by the use of mixed-effects models on the 405 pairs.

pith-pipeline@v0.9.1-grok · 5821 in / 1343 out tokens · 34780 ms · 2026-06-26T19:34:11.221433+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 3 canonical work pages

[1]

In: International Conference on Ar- tificial Intelligence in Education

Aleven, V., Baraniuk, R., Brunskill, E., Crossley, S., Demszky, D., Fancsali, S., Gupta, S., Koedinger, K., Piech, C., Ritter, S., et al.: Towards the future of ai- augmented human tutoring in math learning. In: International Conference on Ar- tificial Intelligence in Education. pp. 26–31. Springer (2023)

2023
[2]

Baker, R.S.: Big Data and Education. Univ. of Pennsylvania, 9 edn. (2025)

2025
[3]

Computers & Education169, 104194 (2021)

Bardach,L.,Klassen,R.M.,Durksen,T.L.,Rushby,J.V.,Bostwick,K.C.,Sheridan, L.: The power of feedback and reflection: Testing an online scenario-based learning intervention for student teachers. Computers & Education169, 104194 (2021)

2021
[4]

In: Proceedings of the LAK26: 16th International Learning Analytics and Knowledge Conference

Borchers, C., Gurung, A., Liu, Q., Thomas, D.R., Khalil, M., Koedinger, K.R.: Brief but impactful: How human tutoring interactions shape engagement in online learning. In: Proceedings of the LAK26: 16th International Learning Analytics and Knowledge Conference. pp. 160–170 (2026)

2026
[5]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

1901
[6]

Cognitive science25(4), 471–533 (2001)

Chi, M.T., Siler, S.A., Jeong, H., Yamauchi, T., Hausmann, R.G.: Learning from human tutoring. Cognitive science25(4), 471–533 (2001)

2001
[7]

In: Proceedings of the Ninth ACM Conference on Learning@Scale (2022)

Chine, D.R., Chhabra, P., Adeniran, A., Gupta, S., Koedinger, K.R.: Development of scenario-based mentor lessons: an iterative design process for training at scale. In: Proceedings of the Ninth ACM Conference on Learning@Scale (2022)

2022
[8]

CMU/PLUS: Ai-driven assessment of human tutors: Linking training perfor- mance to real-life practice, pslc datashop.https://pslcdatashop.web.cmu.edu/ DatasetInfo?datasetId=6815(2026)

2026
[9]

com/CMU-PLUS/tutor_training_to_practice(2026) AI-driven Tutor Training and Assessment 15

CMU/PLUS: Supplementary material: Ai-driven-assessment.https://github. com/CMU-PLUS/tutor_training_to_practice(2026) AI-driven Tutor Training and Assessment 15

2026
[10]

In: Proceedings of the 14th International Conference on Educational Data Mining (EDM 2021) (2021)

Condor, A., Litster, M., Pardos, Z.: Automatic short answer grading with SBERT on out-of-sample questions. In: Proceedings of the 14th International Conference on Educational Data Mining (EDM 2021) (2021)

2021
[11]

Educational Evaluation and Policy Analysis 46(3), 483–505 (2024)

Demszky, D., Liu, J., Hill, H.C., Jurafsky, D., Piech, C.: Can automated feedback improve teachers’ uptake of student ideas? evidence from a randomized controlled trial in a large-scale online course. Educational Evaluation and Policy Analysis 46(3), 483–505 (2024)

2024
[12]

Further Education Unit (1988)

Gibbs, G.: Learning by doing: A guide to teaching and learning methods. Further Education Unit (1988)

1988
[13]

European Journal of Contem- porary Education6(2), 264–279 (2017)

Hursen, C., Fasli, F.G.: Investigating the efficiency of scenario based learning and reflective learning approaches in teacher education. European Journal of Contem- porary Education6(2), 264–279 (2017)

2017
[14]

Aera Open7, 23328584211042858 (2021)

Kraft, M.A., Falken, G.T.: A blueprint for scaling tutoring and mentoring across public schools. Aera Open7, 23328584211042858 (2021)

2021
[15]

Computers and Education: Artificial Intelligence6, 100213 (2024)

Lee, G.G., Latif, E., Wu, X., Liu, N., Zhai, X.: Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence6, 100213 (2024)

2024
[16]

arXiv preprint arXiv:2405.00291 (2024)

Lin, J., Chen, E., Han, Z., Gurung, A., Thomas, D.R., Tan, W., Nguyen, N.D., Koedinger, K.R.: How can i improve? using gpt to highlight the desired and unde- sired parts of open-ended responses. arXiv preprint arXiv:2405.00291 (2024)

work page arXiv 2024
[17]

Nickow, A., Oreopoulos, P., Quan, V.: The impressive effects of tutoring on prek- 12 learning: A systematic review and meta-analysis of the experimental evidence (2020)

2020
[18]

In: Instructional-design theories and models, pp

Schank, R.C., Berman, T.R., Macpherson, K.A.: Learning by doing. In: Instructional-design theories and models, pp. 161–181. Routledge (2013)

2013
[19]

In: Proceedings of the thirteenth language resources and evaluation conference

Suresh, A., Jacobs, J., Harty, C., Perkoff, M., Martin, J.H., Sumner, T.: The talk- moves dataset: K-12 mathematics lesson transcripts annotated for teacher and student discursive moves. In: Proceedings of the thirteenth language resources and evaluation conference. pp. 4654–4662 (2022)

2022
[20]

In: LAK23: 13th International Learning Analytics and Knowledge Conference

Thomas, D., Yang, X., Gupta, S., Adeniran, A., Mclaughlin, E., Koedinger, K.: When the tutor becomes the student: Design and evaluation of efficient scenario- based lessons for tutors. In: LAK23: 13th International Learning Analytics and Knowledge Conference. pp. 250–261 (2023)

2023
[21]

Thomas, D.R., Borchers, C., Bhushan, S., Gatz, E., Gupta, S., Koedinger, K.R.: LLM-generatedfeedbacksupportslearningiflearnerschoosetouseit.In:European Conference on Technology Enhanced Learning. pp. 489–503. Springer (2025)

2025
[22]

In: Proceedings of the 15th International Learning Analytics and Knowledge Conference

Thomas, D.R., Borchers, C., Kakarla, S., Lin, J., Bhushan, S., Guo, B., Gatz, E., Koedinger, K.R.: Does multiple choice have a future in the age of generative AI? a posttest-only RCT. In: Proceedings of the 15th International Learning Analytics and Knowledge Conference. pp. 494–504 (2025)

2025
[23]

In: European Conference on Technology Enhanced Learning

Thomas, D.R., Borchers, C., Lin, J., Kakarla, S., Bhushan, S., Gatz, E., Gupta, S., Abboud, R., Koedinger, K.R.: Leveraging llms to assess tutor moves in real-life dialogues: A feasibility study. In: European Conference on Technology Enhanced Learning. pp. 268–273. Springer (2025)

2025
[24]

arXiv preprint arXiv:2603.29141 (2026)

Thomas, D.R., Borchers, C., Vanacore, K.P., Koedinger, K.R., Kizilcec, R.F.: Mod- ernizing ground truth: Four shifts toward improving reliability and validity in AI in education. arXiv preprint arXiv:2603.29141 (2026)

work page arXiv 2026
[25]

Thompson, M., Owho-Ovuakporie, K., Robinson, K., Kim, Y.J., Slama, R., Reich, J.: Teacher moments: A digital simulation for preservice teachers to approximate parent–teacher conversations. J. of Digital Learning in Teacher Ed (2019) 16 D. R. Thomas et al

2019
[26]

Trochim, W.M., Donnelly, J.P., Arora, K.: Research methods: The essential knowl- edge base (2016)

2016
[27]

In: International conference on intelligent tu- toring systems

Vail, A.K., Boyer, K.E.: Identifying effective moves in tutoring: On the refinement of dialogue act annotation schemes. In: International conference on intelligent tu- toring systems. pp. 199–209. Springer (2014)

2014
[28]

arXiv preprint arXiv:2410.03017 (2024)

Wang,R.E.,Ribeiro,A.T.,Robinson,C.D.,Loeb,S.,Demszky,D.:Tutorcopilot:A human-ai approach for scaling real-time expertise. arXiv preprint arXiv:2410.03017 (2024)

work page arXiv 2024
[29]

Advances in Simulation4(1), 9 (2019)

Weersink, K., Hall, A.K., Rich, J., Szulewski, A., Dagnone, J.D.: Simulation ver- sus real-world performance: a direct comparison of emergency medicine resident resuscitation entrustment scoring. Advances in Simulation4(1), 9 (2019)

2019
[30]

In: Proceedings of the Eleventh ACM Conference on Learning@Scale

Yun, J., Hicke, Y., Olson, M., Demszky, D.: Enhancing tutoring effectiveness through automated feedback: Preliminary findings from a pilot randomized con- trolled trial on sat tutoring. In: Proceedings of the Eleventh ACM Conference on Learning@Scale. pp. 422–426 (2024)

2024

[1] [1]

In: International Conference on Ar- tificial Intelligence in Education

Aleven, V., Baraniuk, R., Brunskill, E., Crossley, S., Demszky, D., Fancsali, S., Gupta, S., Koedinger, K., Piech, C., Ritter, S., et al.: Towards the future of ai- augmented human tutoring in math learning. In: International Conference on Ar- tificial Intelligence in Education. pp. 26–31. Springer (2023)

2023

[2] [2]

Baker, R.S.: Big Data and Education. Univ. of Pennsylvania, 9 edn. (2025)

2025

[3] [3]

Computers & Education169, 104194 (2021)

Bardach,L.,Klassen,R.M.,Durksen,T.L.,Rushby,J.V.,Bostwick,K.C.,Sheridan, L.: The power of feedback and reflection: Testing an online scenario-based learning intervention for student teachers. Computers & Education169, 104194 (2021)

2021

[4] [4]

In: Proceedings of the LAK26: 16th International Learning Analytics and Knowledge Conference

Borchers, C., Gurung, A., Liu, Q., Thomas, D.R., Khalil, M., Koedinger, K.R.: Brief but impactful: How human tutoring interactions shape engagement in online learning. In: Proceedings of the LAK26: 16th International Learning Analytics and Knowledge Conference. pp. 160–170 (2026)

2026

[5] [5]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

1901

[6] [6]

Cognitive science25(4), 471–533 (2001)

Chi, M.T., Siler, S.A., Jeong, H., Yamauchi, T., Hausmann, R.G.: Learning from human tutoring. Cognitive science25(4), 471–533 (2001)

2001

[7] [7]

In: Proceedings of the Ninth ACM Conference on Learning@Scale (2022)

Chine, D.R., Chhabra, P., Adeniran, A., Gupta, S., Koedinger, K.R.: Development of scenario-based mentor lessons: an iterative design process for training at scale. In: Proceedings of the Ninth ACM Conference on Learning@Scale (2022)

2022

[8] [8]

CMU/PLUS: Ai-driven assessment of human tutors: Linking training perfor- mance to real-life practice, pslc datashop.https://pslcdatashop.web.cmu.edu/ DatasetInfo?datasetId=6815(2026)

2026

[9] [9]

com/CMU-PLUS/tutor_training_to_practice(2026) AI-driven Tutor Training and Assessment 15

CMU/PLUS: Supplementary material: Ai-driven-assessment.https://github. com/CMU-PLUS/tutor_training_to_practice(2026) AI-driven Tutor Training and Assessment 15

2026

[10] [10]

In: Proceedings of the 14th International Conference on Educational Data Mining (EDM 2021) (2021)

Condor, A., Litster, M., Pardos, Z.: Automatic short answer grading with SBERT on out-of-sample questions. In: Proceedings of the 14th International Conference on Educational Data Mining (EDM 2021) (2021)

2021

[11] [11]

Educational Evaluation and Policy Analysis 46(3), 483–505 (2024)

Demszky, D., Liu, J., Hill, H.C., Jurafsky, D., Piech, C.: Can automated feedback improve teachers’ uptake of student ideas? evidence from a randomized controlled trial in a large-scale online course. Educational Evaluation and Policy Analysis 46(3), 483–505 (2024)

2024

[12] [12]

Further Education Unit (1988)

Gibbs, G.: Learning by doing: A guide to teaching and learning methods. Further Education Unit (1988)

1988

[13] [13]

European Journal of Contem- porary Education6(2), 264–279 (2017)

Hursen, C., Fasli, F.G.: Investigating the efficiency of scenario based learning and reflective learning approaches in teacher education. European Journal of Contem- porary Education6(2), 264–279 (2017)

2017

[14] [14]

Aera Open7, 23328584211042858 (2021)

Kraft, M.A., Falken, G.T.: A blueprint for scaling tutoring and mentoring across public schools. Aera Open7, 23328584211042858 (2021)

2021

[15] [15]

Computers and Education: Artificial Intelligence6, 100213 (2024)

Lee, G.G., Latif, E., Wu, X., Liu, N., Zhai, X.: Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence6, 100213 (2024)

2024

[16] [16]

arXiv preprint arXiv:2405.00291 (2024)

Lin, J., Chen, E., Han, Z., Gurung, A., Thomas, D.R., Tan, W., Nguyen, N.D., Koedinger, K.R.: How can i improve? using gpt to highlight the desired and unde- sired parts of open-ended responses. arXiv preprint arXiv:2405.00291 (2024)

work page arXiv 2024

[17] [17]

Nickow, A., Oreopoulos, P., Quan, V.: The impressive effects of tutoring on prek- 12 learning: A systematic review and meta-analysis of the experimental evidence (2020)

2020

[18] [18]

In: Instructional-design theories and models, pp

Schank, R.C., Berman, T.R., Macpherson, K.A.: Learning by doing. In: Instructional-design theories and models, pp. 161–181. Routledge (2013)

2013

[19] [19]

In: Proceedings of the thirteenth language resources and evaluation conference

Suresh, A., Jacobs, J., Harty, C., Perkoff, M., Martin, J.H., Sumner, T.: The talk- moves dataset: K-12 mathematics lesson transcripts annotated for teacher and student discursive moves. In: Proceedings of the thirteenth language resources and evaluation conference. pp. 4654–4662 (2022)

2022

[20] [20]

In: LAK23: 13th International Learning Analytics and Knowledge Conference

Thomas, D., Yang, X., Gupta, S., Adeniran, A., Mclaughlin, E., Koedinger, K.: When the tutor becomes the student: Design and evaluation of efficient scenario- based lessons for tutors. In: LAK23: 13th International Learning Analytics and Knowledge Conference. pp. 250–261 (2023)

2023

[21] [21]

Thomas, D.R., Borchers, C., Bhushan, S., Gatz, E., Gupta, S., Koedinger, K.R.: LLM-generatedfeedbacksupportslearningiflearnerschoosetouseit.In:European Conference on Technology Enhanced Learning. pp. 489–503. Springer (2025)

2025

[22] [22]

In: Proceedings of the 15th International Learning Analytics and Knowledge Conference

Thomas, D.R., Borchers, C., Kakarla, S., Lin, J., Bhushan, S., Guo, B., Gatz, E., Koedinger, K.R.: Does multiple choice have a future in the age of generative AI? a posttest-only RCT. In: Proceedings of the 15th International Learning Analytics and Knowledge Conference. pp. 494–504 (2025)

2025

[23] [23]

In: European Conference on Technology Enhanced Learning

Thomas, D.R., Borchers, C., Lin, J., Kakarla, S., Bhushan, S., Gatz, E., Gupta, S., Abboud, R., Koedinger, K.R.: Leveraging llms to assess tutor moves in real-life dialogues: A feasibility study. In: European Conference on Technology Enhanced Learning. pp. 268–273. Springer (2025)

2025

[24] [24]

arXiv preprint arXiv:2603.29141 (2026)

Thomas, D.R., Borchers, C., Vanacore, K.P., Koedinger, K.R., Kizilcec, R.F.: Mod- ernizing ground truth: Four shifts toward improving reliability and validity in AI in education. arXiv preprint arXiv:2603.29141 (2026)

work page arXiv 2026

[25] [25]

Thompson, M., Owho-Ovuakporie, K., Robinson, K., Kim, Y.J., Slama, R., Reich, J.: Teacher moments: A digital simulation for preservice teachers to approximate parent–teacher conversations. J. of Digital Learning in Teacher Ed (2019) 16 D. R. Thomas et al

2019

[26] [26]

Trochim, W.M., Donnelly, J.P., Arora, K.: Research methods: The essential knowl- edge base (2016)

2016

[27] [27]

In: International conference on intelligent tu- toring systems

Vail, A.K., Boyer, K.E.: Identifying effective moves in tutoring: On the refinement of dialogue act annotation schemes. In: International conference on intelligent tu- toring systems. pp. 199–209. Springer (2014)

2014

[28] [28]

arXiv preprint arXiv:2410.03017 (2024)

Wang,R.E.,Ribeiro,A.T.,Robinson,C.D.,Loeb,S.,Demszky,D.:Tutorcopilot:A human-ai approach for scaling real-time expertise. arXiv preprint arXiv:2410.03017 (2024)

work page arXiv 2024

[29] [29]

Advances in Simulation4(1), 9 (2019)

Weersink, K., Hall, A.K., Rich, J., Szulewski, A., Dagnone, J.D.: Simulation ver- sus real-world performance: a direct comparison of emergency medicine resident resuscitation entrustment scoring. Advances in Simulation4(1), 9 (2019)

2019

[30] [30]

In: Proceedings of the Eleventh ACM Conference on Learning@Scale

Yun, J., Hicke, Y., Olson, M., Demszky, D.: Enhancing tutoring effectiveness through automated feedback: Preliminary findings from a pilot randomized con- trolled trial on sat tutoring. In: Proceedings of the Eleventh ACM Conference on Learning@Scale. pp. 422–426 (2024)

2024