arxiv: 2604.13380 · v1 · submitted 2026-04-15 · 💻 cs.HC

Recognition: unknown

Does the TalkMoves Codebook Generalize to One-on-One Tutoring and Multimodal Interaction?

Corina Luca Focsan , Marie Cynthia Abijuru Kamikazi , Tamisha Thompson , Jennifer St. John , Kirk Vanacore , Danielle R. Thomas , Kenneth R. Koedinger , Ren\'e F. Kizilcec

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:27 UTC · model grok-4.3

classification 💻 cs.HC

keywords TalkMovesAccountable Talktutoring discoursemultimodal datacodebook evaluationinter-rater reliabilityeducational technology

0 comments

The pith

TalkMoves codebook shows uneven generalizability to one-on-one tutoring and multimodal interactions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The TalkMoves codebook comes from Accountable Talk theory for classroom discussions. Researchers wondered if it still works well when teachers tutor one student at a time using chat, audio, or full video. They had two expert teachers label six sessions with both TalkMoves and a new codebook made partly by AI. TalkMoves produced more consistent labels between the two teachers, but the AI version captured a wider range of useful tutoring actions and felt easier to apply no matter the format. Neither codebook fully handled the nonverbal parts of tutoring or all the moves that matter in private sessions.

Core claim

This study finds that the human-developed TalkMoves codebook achieves higher inter-rater reliability (k = 0.74) than the AI-human codebook (k = 0.64) when annotating one-on-one tutoring sessions, yet the AI-human codebook offers broader empirical coverage and higher perceived usability across chat, audio, and multimodal modalities. Both codebooks fail to capture all tutoring-relevant moves and create ambiguity for actions expressed nonverbally or through multiple channels.

What carries the argument

Direct comparison of the TalkMoves codebook and a hybrid AI-human codebook on six tutoring sessions across three modalities by two expert annotators, measuring reliability, coverage, and usability.

Load-bearing premise

The small number of six tutoring sessions annotated by only two experts is enough to draw conclusions about how well the codebook generalizes to all one-on-one tutoring and all interaction modalities.

What would settle it

Replicating the annotation process with a larger number of sessions from additional tutoring platforms or subject areas and checking if the patterns of undercapture and ambiguity remain the same.

Figures

Figures reproduced from arXiv: 2604.13380 by Corina Luca Focsan, Danielle R. Thomas, Jennifer St. John, Kenneth R. Koedinger, Kirk Vanacore, Marie Cynthia Abijuru Kamikazi, Ren\'e F. Kizilcec, Tamisha Thompson.

**Figure 2.** Figure 2: Average number of codes identified using the human-developed [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Accountable Talk theory has been widely adopted to analyze classroom discourse and is increasingly used to annotate tutoring interactions. In particular, the TalkMoves codebook, grounded in Accountable Talk theory, is commonly used to label tutoring data and train models of effective instructional support. However, Accountable Talk was originally developed to characterize collaborative, whole-classroom oral discourse, not to identify talk moves in one-on-one tutoring environments using multimodal data (e.g., video, audio, chat). As tutoring platforms expand in scale and modality, questions remain about whether Accountable Talk-based codebooks generalize reliably beyond their original classroom context and data representation. This study examines whether the human-developed TalkMoves codebook generalizes in reliability, utility, and interpretability when applied to one-on-one tutoring across audio, chat, and multimodal data. We compare TalkMoves with a hybrid AI-human developed codebook using a workflow established in prior research. Two expert annotators with over 20 years of teaching experience applied both codebooks to six tutoring sessions spanning three modalities: chat-based, audio-only, and multimodal interactions. Results show that while Talk-Moves achieved higher overall inter-rater reliability than the AI-human codebook (k = 0.74 vs. 0.64), the AI-human codebook demonstrated broader empirical coverage and higher perceived usability across modalities. Both codebooks undercaptured tutoring-relevant moves and introduced ambiguity when identifying actions expressed through nonverbal and multimodal artifacts. Together, these findings highlight the uneven generalizability of TalkMoves to tutoring contexts and motivate the development of modality-aware, tutoring-grounded codebooks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TalkMoves gets higher reliability than the hybrid codebook on tutoring data but narrower coverage, yet the six-session sample leaves generalization claims tentative.

read the letter

The main takeaway is that TalkMoves achieves higher inter-rater reliability (k=0.74) than the hybrid AI-human codebook (k=0.64) when applied to one-on-one tutoring, but the hybrid covers more moves and rates higher on usability across chat, audio, and multimodal sessions. Both miss tutoring-specific actions and struggle with nonverbal or multimodal cues. The comparison rests on only six sessions total, so any broader claims about generalization stay limited.

Referee Report

1 major / 0 minor

Summary. The manuscript examines whether the TalkMoves codebook, grounded in Accountable Talk theory for classroom discourse, generalizes to one-on-one tutoring interactions across chat-based, audio-only, and multimodal modalities. The authors compare TalkMoves to a hybrid AI-human codebook by having two expert annotators (with over 20 years teaching experience) apply both to six tutoring sessions (two per modality), reporting higher inter-rater reliability for TalkMoves (κ=0.74 vs. 0.64) but broader empirical coverage and higher usability for the AI-human codebook; both undercapture tutoring-relevant moves and introduce ambiguity with nonverbal/multimodal artifacts.

Significance. If the core empirical observations hold, the work usefully documents concrete limitations of classroom-derived discourse codebooks when applied to tutoring, particularly in multimodal settings, and supplies specific kappa values plus qualitative usability notes from experienced educators to motivate modality-aware alternatives. The direct side-by-side comparison and identification of undercaptured moves constitute a clear contribution to HCI and learning-sciences annotation research.

major comments (1)

[Methods (description of the six tutoring sessions and annotation procedure) and Results (kappa values and qualitative 0.] The central claim of uneven generalizability (higher IRR but narrower coverage and lower usability for TalkMoves, plus shared struggles with multimodal artifacts) rests on annotations from only six tutoring sessions (two per modality) by two annotators. This sample size is load-bearing for the generalization inferences yet provides insufficient statistical power or diversity to separate systematic codebook limitations from session-specific idiosyncrasies or annotator bias; no additional statistical tests beyond the reported kappas are described to support the coverage and usability comparisons.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive review and the opportunity to clarify aspects of our work. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: The central claim of uneven generalizability (higher IRR but narrower coverage and lower usability for TalkMoves, plus shared struggles with multimodal artifacts) rests on annotations from only six tutoring sessions (two per modality) by two annotators. This sample size is load-bearing for the generalization inferences yet provides insufficient statistical power or diversity to separate systematic codebook limitations from session-specific idiosyncrasies or annotator bias; no additional statistical tests beyond the reported kappas are described to support the coverage and usability comparisons.

Authors: We agree that the small sample (six sessions total) limits the strength of generalization claims and that the study would benefit from larger-scale validation in future work. The design was intentionally focused to enable detailed qualitative comparison by two highly experienced annotators across modalities, yielding both the reported kappas and the usability/coverage observations from their notes and discussions. Coverage and usability were assessed qualitatively rather than through additional quantitative tests. In revision we will add an explicit limitations subsection that qualifies the preliminary nature of the inferences, discusses risks of session-specific effects and annotator bias, and outlines the need for expanded datasets. We will also provide further detail on the annotation workflow and how qualitative insights were derived. revision: partial

standing simulated objections not resolved

Expanding the dataset with additional sessions and annotators to increase statistical power and diversity is not feasible within the scope and timeline of the current revision.

Circularity Check

0 steps flagged

No circularity: purely empirical annotation study with no derivations or self-referential reductions

full rationale

This paper conducts a direct empirical comparison by having two expert annotators apply the TalkMoves codebook and an AI-human codebook to six tutoring sessions across modalities, then reports Cohen's kappa values, coverage observations, and usability judgments. No equations, fitted parameters, predictions, or first-principles derivations appear anywhere in the manuscript. The single reference to a 'workflow established in prior research' is a methodological citation for the annotation process and does not define or force the reported reliability, coverage, or generalizability conclusions. All claims rest on observable annotation data external to any self-referential loop, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical annotation study with no free parameters or invented entities; it relies on standard domain assumptions from qualitative research.

axioms (1)

domain assumption Cohen's kappa is an appropriate and sufficient measure for evaluating codebook reliability and generalizability
Standard in annotation studies but assumes agreement equates to validity and that the chosen sessions represent the domain.

pith-pipeline@v0.9.0 · 5633 in / 1224 out tokens · 50844 ms · 2026-05-10T13:27:01.437663+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 1 canonical work pages

[1]

International Educational Data Mining Society (2022) DoesTalkMovesGeneralize to One-on-One Tutoring? 9

Balyan, R., Arner, T., Taylor, K., Shin, J., Banawan, M., Leite, W.L., McNamara, D.S.: Modeling one-on-one online tutoring discourse using an accountable talk framework. International Educational Data Mining Society (2022) DoesTalkMovesGeneralize to One-on-One Tutoring? 9

2022
[2]

In: Intl

Barany, A., Nasiar, N., Porter, C., Zambrano, A.F., Andres, A.L., Bright, D., Shah, M., Liu, X., Gao, S., Zhang, J., et al.: Chatgpt for education research: Exploring the potential of large language models for qualitative codebook development. In: Intl. conference on artificial intelligence in education. pp. 134–149. Springer (2024)

2024
[3]

Booth, B.M., Jacobs, J., Bush, J.B., Milne, B., Fischaber, T., DMello, S.K.: Human-tutor coaching technology (htct): Automated discourse analytics in a coachedtutoringmodel.In:Proceedingsofthe14thLearningAnalyticsandKnowl- edge Conference. pp. 725–735 (2024)

2024
[4]

Castleberry, A., Nolen, A.: Thematic analysis of qualitative research data: Is it as easy as it sounds? Currents in pharmacy teaching & learning10(6), 807–815 (2018)

2018
[5]

Gibbs, G.R.: Analyzing qualitative data (2018)

2018
[6]

Journal of the Learning sciences22(3), 413–461 (2013)

Herrenkohl, L.R., Cornelius, L.: Investigating elementary students’ scientific and historical argumentation. Journal of the Learning sciences22(3), 413–461 (2013)

2013
[7]

In: Proceedings of the 14th learning analytics and knowledge conference

Hou, C., Zhu, G., Zheng, J., Zhang, L., Huang, X., Zhong, T., Li, S., Du, H., Ker, C.L.: Prompt-based and fine-tuned gpt models for context-dependent and- independent deductive coding in social annotation. In: Proceedings of the 14th learning analytics and knowledge conference. pp. 518–528 (2024)

2024
[8]

In: International Conference on Quantitative Ethnography

Liu, X., Wei, Z., Barany, A., Ocumpaugh, J., Baker, R.S., Zambrano, A.F., Zhou, Y., Giordano, C.: Exploring differences between hybrid gpt-human and human- created qualitative codebooks in an educational game. In: International Conference on Quantitative Ethnography. pp. 193–208. Springer (2025)

2025
[9]

Studies in philosophy and education27(4), 283–297 (2008)

Michaels, S., O’Connor, C., Resnick, L.B.: Deliberative discourse idealized and realized: Accountable talk in the classroom and in civic life. Studies in philosophy and education27(4), 283–297 (2008)

2008
[10]

Pittsburg, PA: Institute for Learning University of Pittsburgh

Michaels, S., O’Connor, M.C., Hall, M.W., Resnick, L.B.: Accountable talk® sourcebook. Pittsburg, PA: Institute for Learning University of Pittsburgh. Mur- phy, PK, Wilkinson, IAG, Soter, AO, Hennessey, MN, & Alexander, JF (2010)

2010
[11]

CoRR (2024)

Modi, A., Veerubhotla, A.S., Rysbek, A., Huber, A., Wiltshire, B., Veprek, B., Gillick, D., Kasenberg, D., Ahmed, D., Jurenka, I., et al.: Learnlm: Improving gemini for learning. CoRR (2024)

2024
[12]

Learning, Media and Technology50(3), 393–409 (2025)

Nguyen, H., Nguyen, V., Ludovise, S., Santagata, R.: Misrepresentation or in- clusion: promises of generative artificial intelligence in climate change education. Learning, Media and Technology50(3), 393–409 (2025)

2025
[13]

International Journal of Educational Research97, 166–175 (2019)

O’Connor, C., Michaels, S.: Supporting teachers in taking up productive talk moves: The long road to professional learning at scale. International Journal of Educational Research97, 166–175 (2019)

2019
[14]

arXiv preprint arXiv:2105.07949 (2021)

Suresh, A., Jacobs, J., Lai, V., Tan, C., Ward, W., Martin, J.H., Sumner, T.: Using transformers to provide teachers with personalized feedback on their classroom discourse: The talkmoves application. arXiv preprint arXiv:2105.07949 (2021)

work page arXiv 2021
[15]

International Journal of Educational Research97, 176–186 (2019)

Webb, N.M., Franke, M.L., Ing, M., Turrou, A.C., Johnson, N.C., Zimmerman, J.: Teacher practices that promote productive dialogue and learning in mathematics classrooms. International Journal of Educational Research97, 176–186 (2019)

2019
[16]

Qualitative sociology24(3), 381–400 (2001)

Weston, C., Gandell, T., Beauchamp, J., McAlpine, L., Wiseman, C., Beauchamp, C.: Analyzing interview data: The development and evolution of a coding system. Qualitative sociology24(3), 381–400 (2001)

2001
[17]

Journal of Educational Data Mining18(1), 25–65 (2026)

Zambrano, A.F., Wei, Z., Zhang, J., Baker, R.S., Ocumpaugh, J., Barany, A., Liu, X., Zhou, Y., Paquette, L., Ginger, J., et al.: Data plus theory equals codebook: Leveraging llms for human-ai codebook development. Journal of Educational Data Mining18(1), 25–65 (2026)

2026