Recognition: unknown
Does the TalkMoves Codebook Generalize to One-on-One Tutoring and Multimodal Interaction?
Pith reviewed 2026-05-10 13:27 UTC · model grok-4.3
The pith
TalkMoves codebook shows uneven generalizability to one-on-one tutoring and multimodal interactions
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
This study finds that the human-developed TalkMoves codebook achieves higher inter-rater reliability (k = 0.74) than the AI-human codebook (k = 0.64) when annotating one-on-one tutoring sessions, yet the AI-human codebook offers broader empirical coverage and higher perceived usability across chat, audio, and multimodal modalities. Both codebooks fail to capture all tutoring-relevant moves and create ambiguity for actions expressed nonverbally or through multiple channels.
What carries the argument
Direct comparison of the TalkMoves codebook and a hybrid AI-human codebook on six tutoring sessions across three modalities by two expert annotators, measuring reliability, coverage, and usability.
Load-bearing premise
The small number of six tutoring sessions annotated by only two experts is enough to draw conclusions about how well the codebook generalizes to all one-on-one tutoring and all interaction modalities.
What would settle it
Replicating the annotation process with a larger number of sessions from additional tutoring platforms or subject areas and checking if the patterns of undercapture and ambiguity remain the same.
Figures
read the original abstract
Accountable Talk theory has been widely adopted to analyze classroom discourse and is increasingly used to annotate tutoring interactions. In particular, the TalkMoves codebook, grounded in Accountable Talk theory, is commonly used to label tutoring data and train models of effective instructional support. However, Accountable Talk was originally developed to characterize collaborative, whole-classroom oral discourse, not to identify talk moves in one-on-one tutoring environments using multimodal data (e.g., video, audio, chat). As tutoring platforms expand in scale and modality, questions remain about whether Accountable Talk-based codebooks generalize reliably beyond their original classroom context and data representation. This study examines whether the human-developed TalkMoves codebook generalizes in reliability, utility, and interpretability when applied to one-on-one tutoring across audio, chat, and multimodal data. We compare TalkMoves with a hybrid AI-human developed codebook using a workflow established in prior research. Two expert annotators with over 20 years of teaching experience applied both codebooks to six tutoring sessions spanning three modalities: chat-based, audio-only, and multimodal interactions. Results show that while Talk-Moves achieved higher overall inter-rater reliability than the AI-human codebook (k = 0.74 vs. 0.64), the AI-human codebook demonstrated broader empirical coverage and higher perceived usability across modalities. Both codebooks undercaptured tutoring-relevant moves and introduced ambiguity when identifying actions expressed through nonverbal and multimodal artifacts. Together, these findings highlight the uneven generalizability of TalkMoves to tutoring contexts and motivate the development of modality-aware, tutoring-grounded codebooks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines whether the TalkMoves codebook, grounded in Accountable Talk theory for classroom discourse, generalizes to one-on-one tutoring interactions across chat-based, audio-only, and multimodal modalities. The authors compare TalkMoves to a hybrid AI-human codebook by having two expert annotators (with over 20 years teaching experience) apply both to six tutoring sessions (two per modality), reporting higher inter-rater reliability for TalkMoves (κ=0.74 vs. 0.64) but broader empirical coverage and higher usability for the AI-human codebook; both undercapture tutoring-relevant moves and introduce ambiguity with nonverbal/multimodal artifacts.
Significance. If the core empirical observations hold, the work usefully documents concrete limitations of classroom-derived discourse codebooks when applied to tutoring, particularly in multimodal settings, and supplies specific kappa values plus qualitative usability notes from experienced educators to motivate modality-aware alternatives. The direct side-by-side comparison and identification of undercaptured moves constitute a clear contribution to HCI and learning-sciences annotation research.
major comments (1)
- [Methods (description of the six tutoring sessions and annotation procedure) and Results (kappa values and qualitative 0.] The central claim of uneven generalizability (higher IRR but narrower coverage and lower usability for TalkMoves, plus shared struggles with multimodal artifacts) rests on annotations from only six tutoring sessions (two per modality) by two annotators. This sample size is load-bearing for the generalization inferences yet provides insufficient statistical power or diversity to separate systematic codebook limitations from session-specific idiosyncrasies or annotator bias; no additional statistical tests beyond the reported kappas are described to support the coverage and usability comparisons.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the opportunity to clarify aspects of our work. We address the major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: The central claim of uneven generalizability (higher IRR but narrower coverage and lower usability for TalkMoves, plus shared struggles with multimodal artifacts) rests on annotations from only six tutoring sessions (two per modality) by two annotators. This sample size is load-bearing for the generalization inferences yet provides insufficient statistical power or diversity to separate systematic codebook limitations from session-specific idiosyncrasies or annotator bias; no additional statistical tests beyond the reported kappas are described to support the coverage and usability comparisons.
Authors: We agree that the small sample (six sessions total) limits the strength of generalization claims and that the study would benefit from larger-scale validation in future work. The design was intentionally focused to enable detailed qualitative comparison by two highly experienced annotators across modalities, yielding both the reported kappas and the usability/coverage observations from their notes and discussions. Coverage and usability were assessed qualitatively rather than through additional quantitative tests. In revision we will add an explicit limitations subsection that qualifies the preliminary nature of the inferences, discusses risks of session-specific effects and annotator bias, and outlines the need for expanded datasets. We will also provide further detail on the annotation workflow and how qualitative insights were derived. revision: partial
- Expanding the dataset with additional sessions and annotators to increase statistical power and diversity is not feasible within the scope and timeline of the current revision.
Circularity Check
No circularity: purely empirical annotation study with no derivations or self-referential reductions
full rationale
This paper conducts a direct empirical comparison by having two expert annotators apply the TalkMoves codebook and an AI-human codebook to six tutoring sessions across modalities, then reports Cohen's kappa values, coverage observations, and usability judgments. No equations, fitted parameters, predictions, or first-principles derivations appear anywhere in the manuscript. The single reference to a 'workflow established in prior research' is a methodological citation for the annotation process and does not define or force the reported reliability, coverage, or generalizability conclusions. All claims rest on observable annotation data external to any self-referential loop, satisfying the criteria for a self-contained empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Cohen's kappa is an appropriate and sufficient measure for evaluating codebook reliability and generalizability
Reference graph
Works this paper leans on
-
[1]
International Educational Data Mining Society (2022) DoesTalkMovesGeneralize to One-on-One Tutoring? 9
Balyan, R., Arner, T., Taylor, K., Shin, J., Banawan, M., Leite, W.L., McNamara, D.S.: Modeling one-on-one online tutoring discourse using an accountable talk framework. International Educational Data Mining Society (2022) DoesTalkMovesGeneralize to One-on-One Tutoring? 9
2022
-
[2]
In: Intl
Barany, A., Nasiar, N., Porter, C., Zambrano, A.F., Andres, A.L., Bright, D., Shah, M., Liu, X., Gao, S., Zhang, J., et al.: Chatgpt for education research: Exploring the potential of large language models for qualitative codebook development. In: Intl. conference on artificial intelligence in education. pp. 134–149. Springer (2024)
2024
-
[3]
Booth, B.M., Jacobs, J., Bush, J.B., Milne, B., Fischaber, T., DMello, S.K.: Human-tutor coaching technology (htct): Automated discourse analytics in a coachedtutoringmodel.In:Proceedingsofthe14thLearningAnalyticsandKnowl- edge Conference. pp. 725–735 (2024)
2024
-
[4]
Castleberry, A., Nolen, A.: Thematic analysis of qualitative research data: Is it as easy as it sounds? Currents in pharmacy teaching & learning10(6), 807–815 (2018)
2018
-
[5]
Gibbs, G.R.: Analyzing qualitative data (2018)
2018
-
[6]
Journal of the Learning sciences22(3), 413–461 (2013)
Herrenkohl, L.R., Cornelius, L.: Investigating elementary students’ scientific and historical argumentation. Journal of the Learning sciences22(3), 413–461 (2013)
2013
-
[7]
In: Proceedings of the 14th learning analytics and knowledge conference
Hou, C., Zhu, G., Zheng, J., Zhang, L., Huang, X., Zhong, T., Li, S., Du, H., Ker, C.L.: Prompt-based and fine-tuned gpt models for context-dependent and- independent deductive coding in social annotation. In: Proceedings of the 14th learning analytics and knowledge conference. pp. 518–528 (2024)
2024
-
[8]
In: International Conference on Quantitative Ethnography
Liu, X., Wei, Z., Barany, A., Ocumpaugh, J., Baker, R.S., Zambrano, A.F., Zhou, Y., Giordano, C.: Exploring differences between hybrid gpt-human and human- created qualitative codebooks in an educational game. In: International Conference on Quantitative Ethnography. pp. 193–208. Springer (2025)
2025
-
[9]
Studies in philosophy and education27(4), 283–297 (2008)
Michaels, S., O’Connor, C., Resnick, L.B.: Deliberative discourse idealized and realized: Accountable talk in the classroom and in civic life. Studies in philosophy and education27(4), 283–297 (2008)
2008
-
[10]
Pittsburg, PA: Institute for Learning University of Pittsburgh
Michaels, S., O’Connor, M.C., Hall, M.W., Resnick, L.B.: Accountable talk® sourcebook. Pittsburg, PA: Institute for Learning University of Pittsburgh. Mur- phy, PK, Wilkinson, IAG, Soter, AO, Hennessey, MN, & Alexander, JF (2010)
2010
-
[11]
CoRR (2024)
Modi, A., Veerubhotla, A.S., Rysbek, A., Huber, A., Wiltshire, B., Veprek, B., Gillick, D., Kasenberg, D., Ahmed, D., Jurenka, I., et al.: Learnlm: Improving gemini for learning. CoRR (2024)
2024
-
[12]
Learning, Media and Technology50(3), 393–409 (2025)
Nguyen, H., Nguyen, V., Ludovise, S., Santagata, R.: Misrepresentation or in- clusion: promises of generative artificial intelligence in climate change education. Learning, Media and Technology50(3), 393–409 (2025)
2025
-
[13]
International Journal of Educational Research97, 166–175 (2019)
O’Connor, C., Michaels, S.: Supporting teachers in taking up productive talk moves: The long road to professional learning at scale. International Journal of Educational Research97, 166–175 (2019)
2019
-
[14]
arXiv preprint arXiv:2105.07949 (2021)
Suresh, A., Jacobs, J., Lai, V., Tan, C., Ward, W., Martin, J.H., Sumner, T.: Using transformers to provide teachers with personalized feedback on their classroom discourse: The talkmoves application. arXiv preprint arXiv:2105.07949 (2021)
-
[15]
International Journal of Educational Research97, 176–186 (2019)
Webb, N.M., Franke, M.L., Ing, M., Turrou, A.C., Johnson, N.C., Zimmerman, J.: Teacher practices that promote productive dialogue and learning in mathematics classrooms. International Journal of Educational Research97, 176–186 (2019)
2019
-
[16]
Qualitative sociology24(3), 381–400 (2001)
Weston, C., Gandell, T., Beauchamp, J., McAlpine, L., Wiseman, C., Beauchamp, C.: Analyzing interview data: The development and evolution of a coding system. Qualitative sociology24(3), 381–400 (2001)
2001
-
[17]
Journal of Educational Data Mining18(1), 25–65 (2026)
Zambrano, A.F., Wei, Z., Zhang, J., Baker, R.S., Ocumpaugh, J., Barany, A., Liu, X., Zhou, Y., Paquette, L., Ginger, J., et al.: Data plus theory equals codebook: Leveraging llms for human-ai codebook development. Journal of Educational Data Mining18(1), 25–65 (2026)
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.